-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a formal post-processing step #20
Comments
earlier discussion #13 |
I think I like the general idea of this, and thanks for taking my use case into account. A few thoughts:
Worth discussing here -- There's some things right now that we're both doing separately -- pipelines for bill version downloading, plaintext-ing, summary extraction, member linking, etc. I think it's probably scope creep to talk about doing that here, but if we're talking about a pipeline anyway maybe not. On one hand I don't want to turn each bill scrape into a multi-hour 20 gig of ram affair, but on the other hand that work's getting done someplace =P. Couple thoughts about the specific issue we want to solve: For bill actions, we want:
FWIW (and we've kind of discussed this elsewhere) many consumers are trying to translate these actions into some kind of bill chart of standardized steps anyway, so maybe we should just formalize this in the metadata. For member disambiguation we want:
I think moving this to some kind of post-process step and out of the bill scraper code makes sense here, but I am wondering if divorcing it from the step where you have easy access to the scraped HTML is going to bite us in code complexity. There will still be cases where you'll the scraped data to do the classification, so it's happening in two separate spots. Issues:
I'll think on this some more in the coming days, but initially it seems like a pretty good idea. |
There are a few things all moving along that I'd put together under the umbrella of post-processing.
I have a draft PR in progress and will go ahead and put that up soon but wanted to perhaps kick the conversation off a bit earlier to take concerns into account and try to figure out what needs to be handled.
The basic idea here is to add a formal post-processing step that occurs as part of the scrape, but that can be re-run independently of a scrape when improvements are made.
Use cases:
The way things work right now, these things can either be part of the scrape (in which case they are reflected in the JSON) or they are part of the import process (in which case they are done as the JSON is being transformed into database objects).
This presents a few issues:
So what I'm proposing is essentially a new phase, that runs after scrape.
I have some ideas about how it would run, and plan to sketch them out and experiment a bit with pros/cons of each but wanted to get initial feedback, concerns, things to keep in mind, etc.
But high level, I think it'd wind up looking something like:
The only real magic would be the postprocess functions that could be written in a way where they can run against the JSON data or database objects, this would make it possible to have a generic command to reprocess any/all of the data. (e.g. re-run the action classifier on KS 2021)
A few thoughts on backwards compatibility (aka how to do this without blowing everything up):
The text was updated successfully, but these errors were encountered: