Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

378 update submission processing to use polars data validator #394

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jcadam14
Copy link
Contributor

@jcadam14 jcadam14 commented Sep 5, 2024

Closes #378

Changes the filing-api in the following ways:

  • When the content is uploaded, use python csv to get the row count since chunking generators in data-validator no longer provide final counts of rows in file
  • Use generator from validator to get findings dataframe, and collate into final dataframe
  • Updated to generate a file path that's either local or s3://, which the data-validator can use with fsspec to correctly use either way
  • Calls new df_to_download to write out s3, the data validator handles the writing
  • Updated build_validation_results to use the collated findings dataframe to build out the json for the frontend, ValidationResults are no longer used in the filing-api
  • Updated the pytests
  • Updated boto3 to stick to version 1.34 since fsspec and s3fs (which is used in the data-validator) don't currently support botocore 1.35. When the data-validator is separated, we can switch back to the latest boto3

@jcadam14 jcadam14 self-assigned this Sep 5, 2024
@jcadam14 jcadam14 linked an issue Sep 5, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Sep 5, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/sbl_filing_api
  config.py
  src/sbl_filing_api/routers
  filing.py
  src/sbl_filing_api/services
  submission_processor.py 64
Project Total  

This report was generated by python-coverage-comment-action

@jcadam14
Copy link
Contributor Author

jcadam14 commented Nov 7, 2024

This was tested on the devpub cluster as well as pytests and docker-compose. I had to merge this with the counters PR (#473) since the database and all that had been updated already with that work. So on devpub, it's a combo of these two. But once one of them is approved I'll do the work to merge them in whatever other PR is left open

Comment on lines 118 to 119
except RuntimeError as re:
log.exception("The file is malformed.", re)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be needed, log.exception contains exc_info=True, which extracts the exception from the stack.

@@ -14,11 +14,12 @@ asyncpg = "^0.29.0"
regtech-api-commons = {git = "https://github.com/cfpb/regtech-api-commons.git"}
regtech-data-validator = {git = "https://github.com/cfpb/regtech-data-validator.git"}
regtech-regex = {git = "https://github.com/cfpb/regtech-regex.git"}
boto3 = "~1.34.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the downgrade? fsspec related?

Copy link
Contributor Author

@jcadam14 jcadam14 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, with the boto3 and fsspec tie in in the data-validator. But if I remove that to have the download send back a df, this dependency goes away. Though I'd need to rethink reading in the csv. Right now the filing api just sends over a path and the data validator decides what to do based on path type. s3fs needs the 1.34 version of boto3

pyproject.toml Outdated
alembic = "^1.13.3"
async-lru = "^2.0.4"
ujson = "^5.10.0"
psutil = "^6.0.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope!

)
report_path = generate_file_path(period_code, lei, f"{submission.id}_report")

df_to_download(final_df, report_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can worry about this later, but validator should have no knowledge about how to store results / what to do with the results. the wrapper (in this case the filing API) should be the one that deals with uploading the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think originally I was trying to use a more streaming approach. I don't mind updating all this back to have the filing api know. I had a weird mix of interim solutions going on so this is a carry over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update submission processing to use polars data-validator
2 participants