-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize bulk data file access #17
Comments
Thanks for sharing this idea here. Let's take one step back and discuss the problem we're trying to solve, there are a few issues I see with this technical implementation but understanding the reasons for these decisions and desired outcome would help plan something that could probably meet these needs and others. A couple of questions to hopefully address:
I can see a few options for improvements coming out of this such as providing a JSON manifest or getting us back to more predictable file names for the zip files. Glad to work to try to find a solution here. |
I’ve been using the v3 API, but quickly hit a 429 status code. Now that’s a separate issue, and one that could be easily fixed, but I would also imagine it solves a different use case entirely: smaller, targeted data. My primary use case is parsing the older data. It’s much more efficient to pull down bulk data at rest and then run an expensive ETL process. That same data can be tagged and reused, transactionally. I can build retry policies, construct semaphores to rate limit client access, paginate the data, and then process it, but that seems expensive and prone to concurrency issues. Secondarily, consistent and predictable access to the bulk data seems necessary to break down the data into consumable chunks - easier to read/understand and program for. I see the API as a great tool to get smaller updates when caches are invalidated or otherwise, outside of the bulk ETL process use case. |
An alternative to FTP could be HTTPS, but a client would need to be built in order to extract the data. I have the US Census Tiger FTP server in mind: download all the data you need, legacy included, and then process it. I imagine state legislative data rarely changes after the session expires. |
What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something. |
openstates.org/data has the zip files i’m referencing - have you tried using those?
…On Mar 16, 2021, 10:50 PM -0400, Matt Heiler , wrote:
What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yes, that's what I was looking at.
"PostgreSQL Data Dump" has the bulk data I want, but requires a bit of setup to access... I could setup Docker and access the database through the container. "Legacy Data Dumps" is the data and format I'd like (at a quick glance), but it looks like it's stale and won't be updated. |
Can you elaborate a bit on these two things:
If the links didn't have the random hashes on the end would the data.openstates.org/csv/ stuff be close to what you want? |
Sure! A.) For HTTPS, I would need to write a library/application that downloads the files from each resource path. That's fine, but what's missing is something like a manifest/index that tells me what's in the directory/path/route. If I'm trying to walk through all the available data, I need to know that resource at "~/MT/116/2/bills.json" is available, or I could keep traversing until I hit a 404. With FTP, you could download the chunk/slice you need using an FTP client (e.g., FileZilla). I can request all files for Montana. You could run into uses cases for manifests/indexes using FTP, but at least you can download the entire folder recursively and then process it. HTTPS works, if that's easier to manage. B.) JSON would be preferable to CSV, that would be great. Right now, I'd have to scrape the DOM and get all the links, ignore the checksum hashes, and handle the following conventions in order to extract metadata and determine what's available in bulk, programmatically:
So, I guess given A and B, we can maybe adjust the request to: Bulk Data API w/ JSON resources? My thought was that FTP file dump seems like an easier/direct solution to bulk data, but either does the job. Edit: FTP protocols allow for the |
I agree & think a manifest that contains links to all of the files would be doable here, and the main addition to what we already have (the manifest could list both JSON & CSV versions). |
Would something like this work?
|
Or if you wanted to return the raw JSON... I suppose that's what you had in mind, reading back. |
Yes, I think writing out a static file would be the preferred way to handle this since it doesn't change frequently enough IMO to warrant an API. Would you need much more than this:
|
Sorry for the delayed response. I think session number would implicitly cover legislature number. State would be covered by jurisdiction. Session classification would be nice, if not provided in the files (e.g., regular, special_or_extraordinary, interim). Would there be one file with all of the bulk data (i.e., bill_abstracts, bill_actions, bill_sources, etc.)? If so, that should be enough! |
Right now, bulk data file naming and structure has no consistency across states, nor manifest as to what's available. I'd propose, should it prove viable, the following file structure, accessible via an FTP server:
FTP provides a standard protocol for easy manual and programmatic access and authentication/authorization. The consistent file naming and structure promotes data organization and allows for users/systems to predictably consume the data. Other data formats could be able to be placed in the same directory, using the same naming conventions.
The text was updated successfully, but these errors were encountered: