Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a formal post-processing step #20

Open
jamesturk opened this issue Apr 1, 2021 · 2 comments
Open

add a formal post-processing step #20

jamesturk opened this issue Apr 1, 2021 · 2 comments

Comments

@jamesturk
Copy link
Member

jamesturk commented Apr 1, 2021

There are a few things all moving along that I'd put together under the umbrella of post-processing.

I have a draft PR in progress and will go ahead and put that up soon but wanted to perhaps kick the conversation off a bit earlier to take concerns into account and try to figure out what needs to be handled.

The basic idea here is to add a formal post-processing step that occurs as part of the scrape, but that can be re-run independently of a scrape when improvements are made.

Use cases:

The way things work right now, these things can either be part of the scrape (in which case they are reflected in the JSON) or they are part of the import process (in which case they are done as the JSON is being transformed into database objects).

This presents a few issues:

  • Neither of these is easy to re-run in their current form, so we wind up leaving old data unresolved or with custom scripts to go back & fix old data/etc. I think this will become even more important as we consider paths towards more incremental scraping, a bill that isn't scraped won't get updated in the same way as others if its actions are better categorized/etc.
  • Developing on these requires repeatedly running a scrape with --fast or --import as appropriate. (This can particularly be a challenge in states where fast mode isn't that fast.)
  • Scraper code is forced to re-invent the wheel and be littered with classification, which might only get worse as we do more like effective date handling.
  • Also worth mentioning is that @showerst doesn't use anything beyond the scrape step, so pushing that work into import as-is would mean a regression from his point of view.

So what I'm proposing is essentially a new phase, that runs after scrape.

I have some ideas about how it would run, and plan to sketch them out and experiment a bit with pros/cons of each but wanted to get initial feedback, concerns, things to keep in mind, etc.

But high level, I think it'd wind up looking something like:

  • Scrape step would output JSON as it does now.
  • Postprocess step would have various functions that run to augment the JSON, the result of all of these steps would be an augmented JSON file.
  • Import/etc. would work as it does now.

The only real magic would be the postprocess functions that could be written in a way where they can run against the JSON data or database objects, this would make it possible to have a generic command to reprocess any/all of the data. (e.g. re-run the action classifier on KS 2021)

A few thoughts on backwards compatibility (aka how to do this without blowing everything up):

  • By default this step would run automatically so the JSON output would have any/all augmentations that are expected. Hopefully this could be pretty painless to keep using the JSON as it is.
  • I figure they could be turned on jurisdiction-by-jurisdiction so existing scrapers could keep their action classifiers inline until factored out to this new location/format.
@jamesturk
Copy link
Member Author

earlier discussion #13

@showerst
Copy link
Contributor

showerst commented Apr 1, 2021

I think I like the general idea of this, and thanks for taking my use case into account.

A few thoughts:

  1. Perhaps we could have these as postprocess steps that are built as discrete tasks. They all (?) run by default, but can be overridden at scrape time by some argument list, or env var or whatever. I'm envisioning each jurisdiction having either a single file with all the post-process classes, or a directory and each one gets its own file.
poetry run os-update nc bills --scrape postprocessors=ActionClassifier,SponsorLinker,VoterLinker
  1. Then a 'save' hook fires upon completion so the augmented data can be emitted to file, the standard database importer, an SQS queue, whatever. I'm doing this now by overriding the save handler, but we could probably trivially plugin-ize this.
  2. Alternately, the tasks could be run against an existing postgres DB or path of JSON files, without doing a scrape
  3. I think it would be important that these run directly as the objects are saved, rather than after a full scrape. This comes back to incremental scraping in really slow states.

Worth discussing here -- There's some things right now that we're both doing separately -- pipelines for bill version downloading, plaintext-ing, summary extraction, member linking, etc. I think it's probably scope creep to talk about doing that here, but if we're talking about a pipeline anyway maybe not. On one hand I don't want to turn each bill scrape into a multi-hour 20 gig of ram affair, but on the other hand that work's getting done someplace =P.

Couple thoughts about the specific issue we want to solve:

For bill actions, we want:

  1. Action Classification
  2. (in some jurisdictions) Actor
  3. Changes to the bill's "state" -- Passed Origin Chamber, Dead, Vetoed, Effective Dates, Citations, etc. In some cases, you need to look at multiple actions to modify the state, which is a weakness of the existing code. I ran into this recently in Oklahoma, both chambers have to override a governor's veto for it to be a true veto-override, which happens in separate actions. This is also true for tracking 'enrolled'.

FWIW (and we've kind of discussed this elsewhere) many consumers are trying to translate these actions into some kind of bill chart of standardized steps anyway, so maybe we should just formalize this in the metadata.

For member disambiguation we want:

  1. To check a (name, chamber, jurisdiction, session_date) pair against some kind of alias -> ID list. Database, flat file from openstates-people, whatever.

I think moving this to some kind of post-process step and out of the bill scraper code makes sense here, but I am wondering if divorcing it from the step where you have easy access to the scraped HTML is going to bite us in code complexity. There will still be cases where you'll the scraped data to do the classification, so it's happening in two separate spots.

Issues:

  1. Are there any states where the action classification data comes from data that's not currently getting saved in the bill object? I'm thinking about states like Arizona here, I recall that you have to check for the existence of a bunch of keys on a downloaded JSON to determine the state of things, and add fake actions that way. I think there are a few states where we get the actor or classification with CSS queries, because the chambers have different background colors or something, but i'm not 100% on that.
  2. There's some goofy state specific things, like that recent one in Michigan where we're determining the actor based on the upper/lower case of the action. but only if some HTML element is blank. I suppose that's no different separated out into a different file that says "classify these actions if the actor is null" but that might require some hacks with the model constraints, to allow emitting 'broken' bills that (hopefully) get fixed by later steps.

I'll think on this some more in the coming days, but initially it seems like a pretty good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants