Standardize bulk data file access #17

mattheiler · 2021-03-16T01:38:50Z

Right now, bulk data file naming and structure has no consistency across states, nor manifest as to what's available. I'd propose, should it prove viable, the following file structure, accessible via an FTP server:

/<state>
  /<legislature>
    /<session>
      bill_abstracts.csv
      bill_actions.csv
      bill_sources.csv
      bill_sponsorships.csv
      bill_version_links.csv
      bill_versions.csv
      vote_counts.csv
      vote_people.csv
      vote_sources.csv
      votes.csv

FTP provides a standard protocol for easy manual and programmatic access and authentication/authorization. The consistent file naming and structure promotes data organization and allows for users/systems to predictably consume the data. Other data formats could be able to be placed in the same directory, using the same naming conventions.

The text was updated successfully, but these errors were encountered:

jamesturk · 2021-03-16T14:33:27Z

Thanks for sharing this idea here.

Let's take one step back and discuss the problem we're trying to solve, there are a few issues I see with this technical implementation but understanding the reasons for these decisions and desired outcome would help plan something that could probably meet these needs and others.

A couple of questions to hopefully address:

It'd also be good to understand the needs/use case here. We have an API for programmatic access, and bulk data isn't updated as regularly (monthly or weekly instead of continuously) so it'd be good to understand what use cases we aren't currently accounting for. Can you elaborate on how you came to want this?
Is there a reason you request FTP instead of HTTPS access? I can't see us offering FTP but if there are things you're trying to accomplish that way I'm open to finding ways to handle them.
The directory layout of the individual zip files offered is uniform and in a very similar format to what you suggest (split apart into per-session zip files) I'm trying to think of times that downloading the zip wouldn't suffice but coming up short, can you elaborate on the use cases for having the bare CSVs instead of a similar structure within the zip files?

I can see a few options for improvements coming out of this such as providing a JSON manifest or getting us back to more predictable file names for the zip files. Glad to work to try to find a solution here.

mattheiler · 2021-03-17T02:44:28Z

I’ve been using the v3 API, but quickly hit a 429 status code. Now that’s a separate issue, and one that could be easily fixed, but I would also imagine it solves a different use case entirely: smaller, targeted data. My primary use case is parsing the older data. It’s much more efficient to pull down bulk data at rest and then run an expensive ETL process. That same data can be tagged and reused, transactionally. I can build retry policies, construct semaphores to rate limit client access, paginate the data, and then process it, but that seems expensive and prone to concurrency issues. Secondarily, consistent and predictable access to the bulk data seems necessary to break down the data into consumable chunks - easier to read/understand and program for. I see the API as a great tool to get smaller updates when caches are invalidated or otherwise, outside of the bulk ETL process use case.

mattheiler · 2021-03-17T02:48:31Z

An alternative to FTP could be HTTPS, but a client would need to be built in order to extract the data. I have the US Census Tiger FTP server in mind: download all the data you need, legacy included, and then process it. I imagine state legislative data rarely changes after the session expires.

mattheiler · 2021-03-17T02:50:11Z

What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something.

jamesturk · 2021-03-17T10:09:40Z

openstates.org/data has the zip files i’m referencing - have you tried using those?

…

On Mar 16, 2021, 10:50 PM -0400, Matt Heiler , wrote: What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

mattheiler · 2021-03-17T13:37:53Z

Yes, that's what I was looking at.

https://data.openstates.org/csv/latest/MN_2017s1_csv_1WmTzoKh4tCRQRBBIR6DDI.zip
"Bill & Vote CSV Data" is the data I want, but the links are inconsistent and CSV isn't the best format for my use case.

"PostgreSQL Data Dump" has the bulk data I want, but requires a bit of setup to access... I could setup Docker and access the database through the container.

"Legacy Data Dumps" is the data and format I'd like (at a quick glance), but it looks like it's stale and won't be updated.

jamesturk · 2021-03-17T17:59:51Z

Can you elaborate a bit on these two things:

the need for a custom client for HTTPS vs. FTP, I'm not sure I understand this part yet either- in both cases you'd need a hostname and a list of files to fetch, or traversal policy, no?
re: CSV is not the best format for your use case: Your proposal as I understood it was for CSV. Are you looking for JSON instead/as well? I have experimented with that here: https://openstates.org/data/session-json/ but haven't yet committed to it long-term.

If the links didn't have the random hashes on the end would the data.openstates.org/csv/ stuff be close to what you want?

mattheiler · 2021-03-17T20:56:08Z

Sure!

A.) For HTTPS, I would need to write a library/application that downloads the files from each resource path. That's fine, but what's missing is something like a manifest/index that tells me what's in the directory/path/route. If I'm trying to walk through all the available data, I need to know that resource at "~/MT/116/2/bills.json" is available, or I could keep traversing until I hit a 404. With FTP, you could download the chunk/slice you need using an FTP client (e.g., FileZilla). I can request all files for Montana. You could run into uses cases for manifests/indexes using FTP, but at least you can download the entire folder recursively and then process it. HTTPS works, if that's easier to manage.

B.) JSON would be preferable to CSV, that would be great. Right now, I'd have to scrape the DOM and get all the links, ignore the checksum hashes, and handle the following conventions in order to extract metadata and determine what's available in bulk, programmatically:

https://data.openstates.org/json/latest/ia_2017-2018_json_6DPvA516TSfYZac9ZMiGgL.zip
https://data.openstates.org/json/latest/in_2018ss1_json_4hvwnIwvxrPipjyLnUEK4w.zip
https://data.openstates.org/json/latest/ky_2017RS_json_16dEpSSGPNrp8rRgJESd0s.zip
https://data.openstates.org/json/latest/ky_2019SS_json_1Kz72iU6XG1BxDWraUu8WF.zip
https://data.openstates.org/json/latest/la_2018%201st%20Extraordinary%20Session_json_Ne5BCiGPXvgHhTbE2OjRq.zip

So, I guess given A and B, we can maybe adjust the request to: Bulk Data API w/ JSON resources? My thought was that FTP file dump seems like an easier/direct solution to bulk data, but either does the job.

Edit: FTP protocols allow for the ls command, which would negate the need for a file directory manifest/index.

jamesturk · 2021-03-24T20:29:19Z

I agree & think a manifest that contains links to all of the files would be doable here, and the main addition to what we already have (the manifest could list both JSON & CSV versions).

mattheiler · 2021-03-26T03:40:30Z

Would something like this work?

swagger: "2.0"
info:
  version: "1.0.0"
  title: "Open States Bulk Data API"
paths:
  /sessions:
    get:
      produces:
      - "application/json"
      parameters:
      - name: "state"
        in: "query"
        type: "string"
        enum:
        - "AL"
        - "AK"
        - "AS"
        - ".."
        - "WY"
      - name: "legislature"
        in: "query"
        type: "integer"
      - name: "number"
        in: "query"
        type: "integer"
        enum:
        - 1
        - 2
      - name: "classification"
        in: "query"
        type: "string"
        enum:
        - "regular"
        - "special_or_extraordinary"
        - "interim"
      responses:
        "200":
          description: "success"
          schema:
            type: "array"
            items:
              $ref: "#/definitions/Session"
definitions:
  Session:
    type: "object"
    properties:
      state:
        type: "string"
      legislature:
        type: "integer"
      number:
        type: "integer"
      classification:
        type: "string"
        enum:
        - "regular"
        - "special_or_extraordinary"
        - "interim"
      files:
        type: "array"
        items:
          $ref: "#/definitions/SessionFile"
  SessionFile:
    type: "object"
    properties:
      url:
        type: "string"
      name:
        type: "string"
      type:
        type: "string"
        enum:
        - "bill_abstracts"
        - "bill_actions"
        - "bill_sources"
        - "bill_sponsorships"
        - "bill_version_links"
        - "bill_versions"
        - "bills"
        - "vote_counts"
        - "vote_people"
        - "vote_sources"
        - "votes"

mattheiler · 2021-03-26T16:01:31Z

Or if you wanted to return the raw JSON... I suppose that's what you had in mind, reading back.

jamesturk · 2021-04-01T15:21:57Z

Yes, I think writing out a static file would be the preferred way to handle this since it doesn't change frequently enough IMO to warrant an API.

Would you need much more than this:

{
  "files": [
    {
      "jurisdiction": "ocd-jurisdiction/country:us/state:nc",
      "session": "2019",
      "type": "JSON Bulk Download",
      "updated_at": "2020-01-01T12:34:56Z",
      "url": "https://data.openstates.org/path/to/file"
    },
    {
      "jurisdiction": "ocd-jurisdiction/country:us/state:nc",
      "session": "2020",
      "type": "JSON Bulk Download",
      "updated_at": "2020-01-01T12:34:56Z",
      "url": "https://data.openstates.org/path/to/file"
    }
  ]
}

mattheiler · 2021-04-30T14:42:18Z

Sorry for the delayed response. I think session number would implicitly cover legislature number. State would be covered by jurisdiction. Session classification would be nice, if not provided in the files (e.g., regular, special_or_extraordinary, interim). Would there be one file with all of the bulk data (i.e., bill_abstracts, bill_actions, bill_sources, etc.)? If so, that should be enough!

jamesturk added the pre-draft label Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize bulk data file access #17

Standardize bulk data file access #17

mattheiler commented Mar 16, 2021 •

edited

Loading

jamesturk commented Mar 16, 2021

mattheiler commented Mar 17, 2021

mattheiler commented Mar 17, 2021

mattheiler commented Mar 17, 2021

jamesturk commented Mar 17, 2021 via email

mattheiler commented Mar 17, 2021 •

edited

Loading

jamesturk commented Mar 17, 2021

mattheiler commented Mar 17, 2021 •

edited

Loading

jamesturk commented Mar 24, 2021

mattheiler commented Mar 26, 2021 •

edited

Loading

mattheiler commented Mar 26, 2021

jamesturk commented Apr 1, 2021

mattheiler commented Apr 30, 2021

Standardize bulk data file access #17

Standardize bulk data file access #17

Comments

mattheiler commented Mar 16, 2021 • edited Loading

jamesturk commented Mar 16, 2021

mattheiler commented Mar 17, 2021

mattheiler commented Mar 17, 2021

mattheiler commented Mar 17, 2021

jamesturk commented Mar 17, 2021 via email

mattheiler commented Mar 17, 2021 • edited Loading

jamesturk commented Mar 17, 2021

mattheiler commented Mar 17, 2021 • edited Loading

jamesturk commented Mar 24, 2021

mattheiler commented Mar 26, 2021 • edited Loading

mattheiler commented Mar 26, 2021

jamesturk commented Apr 1, 2021

mattheiler commented Apr 30, 2021

mattheiler commented Mar 16, 2021 •

edited

Loading

mattheiler commented Mar 17, 2021 •

edited

Loading

mattheiler commented Mar 17, 2021 •

edited

Loading

mattheiler commented Mar 26, 2021 •

edited

Loading