Add failed-import batch archiving to aid debugging #24

jessemortenson · 2024-10-24T02:07:56Z

My goal here is to make it easier to debug failed imports that occur under realtime processing. This code zips up the jurisdiction data directory that was being processed when an import fails, and puts it in the archive/ path of the realtime processing S3 bucket. It logs the name of the archive zip to the log message, so that a debugger can download and examine which data was being imported at the time.

This is kind of yet another hack on top of hacks, but hopefully at least just a logging thing and not something that further complicates the data flow here.

Thinking through this, I think a more reasonable overall process might be something like the following. I didn't implement this because it would be more work, and work spanning into openstates-core. But consider this a little mini-EP tagged onto this PR for your feedback to inform future work.

Scraper is yielding scraped entities
Process that is receiving those and saving to JSON keeps track up to a certain increment (15 minutes? 200 entities?)
Once that increment is met, that process consolidates the data in that increment into a couple of parquet files (one per entity type); then uploads those to S3 along with an SQS message identifying them
The realtime lambda receives the message and processes the full increment as one batch

I think that would help a few things:

Reduce S3 API costs because we're uploading 1-3 parquet files per batch, instead of thousands of JSON Files all the time
An error in processing in the lambda could result in the same parquet files being simply moved to an archive location
Reduce the odds that multiple lambda executions are processing files from the same jurisdiction at the same time, and increase the odds that concurrent executions are instead each working on their own jurisdiction. I don't know for sure but I suspect this will reduce a few errors.

Now of course I think our original idea of the big SQL Mesh transformation engine is even better than the above, but the above is less work than that and probably still a significant improvement.

alexobaseki

LGTM! The idea for improvement also look solid.v I have a question around conditional logic but just for clarification and testing to be sure it works as expected.

alexobaseki · 2024-10-24T23:32:06Z

app.py

+    sqs_delete_fetched_messages = os.environ.get("SQS_DELETE_FETCHED_MESSAGES", True)
+    if (
+            sqs_delete_fetched_messages is not False
+            and sqs_delete_fetched_messages.lower() != "true" and sqs_delete_fetched_messages != "1"


I don't follow this logic. Doesn't sqs_delete_fetched_messages.lower() != "true" and sqs_delete_fetched_messages != "1" mean that sqs_delete_fetched_messages is probably "false" or "0". Shouldn't that be sqs_delete_fetched_messages = False then?

Add failed-import batch archiving to aid debugging

583a0ff

jessemortenson requested a review from alexobaseki October 24, 2024 02:07

alexobaseki approved these changes Oct 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add failed-import batch archiving to aid debugging #24

Add failed-import batch archiving to aid debugging #24

jessemortenson commented Oct 24, 2024

alexobaseki left a comment

alexobaseki Oct 24, 2024

Add failed-import batch archiving to aid debugging #24

Are you sure you want to change the base?

Add failed-import batch archiving to aid debugging #24

Conversation

jessemortenson commented Oct 24, 2024

alexobaseki left a comment

Choose a reason for hiding this comment

alexobaseki Oct 24, 2024

Choose a reason for hiding this comment