Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where are you getting the mappings data? #39

Open
NickCrews opened this issue May 25, 2023 · 6 comments
Open

Where are you getting the mappings data? #39

NickCrews opened this issue May 25, 2023 · 6 comments

Comments

@NickCrews
Copy link

Hi!

I'm working on a port of this to rust. I'm trying to decide where to source the schema mappings. Possible options I've found are:

I found that FastFEC chose to use this repo as their upstream.

Could you explain what you see as the pros/cons of each of these?

So far, what I see are:

  • The CSVs don't seem to be complete, eg they don't contain a schema for F3N forms
  • The CSVs don't contain type info

Are there other considerations I'm missing?

I would love to make it so that there was one complete and accurate listing of schemas so that the wide range of parsers would not have to duplicate this effort. Any idea what would be required to make that happen?

CC @esonderegger @dwillis @freedmand

NickCrews added a commit to NickCrews/feco3 that referenced this issue May 25, 2023
@esonderegger
Copy link
Owner

If I had one regret with respect to how this library is set up, it would be that I wish I had used a git submodule to have the CSV files from https://github.com/dwillis/fech-sources be the source of mapping data.

At the time, the Senate was doing its filings via paper and those senate filings were important to my employer. So I spent a fair amount of time extending the js/json file that @chriszs had built for fec-parse (which also uses fech-sources as its upstream source of data). I found the json easy to hand-edit, but it's been a task long on my to-do list to get the mappings I added back into fech-sources.

The most definitive source of data for these mappings is the xls/xsls files the FEC hosts here. Click to expand "Electronically filed reports" and "Paper filed reports" and then click to download the "File formats, header file and metadata" files. I also highly recommend reading through those files before embarking on writing a parser as they will be a huge help in understanding how the filings are structured.

To answer your two questions:

  • The CSVs are complete for filings like the F3 - the F3N, F3A, and F3T all use the same mappings. I believe N stands for "new", A stands for "amended" and T stands for "terminate". (For anyone who knows for sure, please correct me if I'm wrong on these). For example, if a committee is filing an F3 for the first time in a given cycle, it's an F3N. If they need to amend it, its an F3A. If the committee is shutting down. Their last F3 filing will be an F3T.
  • I wouldn't recommend using my types.json file for anything, as they're nowhere near complete. I've started working on a script to merge mappings from the json back into the CSVs. In that script I've worked on bringing in type information from the FEC's xls files. It would then be up to the maintainers of the parsing libraries for each language. For example, the FEC uses AMT-12 for fields that are normally monetary values, and NUM-8 for dates. One potential minefield to warn you about: fields marked A-1 could be boolean, but could also have more options. I don't think there's any way to programmatically know, and I'd recommend keeping those as strings/chars.

What I see as the pros/cons of each are:

  • The mappings.json file has some mappings for either super-old versions or obscure filings that don't exist in the CSVs. If your goal is to write a script that ingests every filing going back to the beginning (1997?) and get something for each one, I'd go this route.
  • The CSVs are likely to be the first to be updated if the FEC ever comes out with a new version. If you want to parse filings from present day going forward, I'd recommend using the CSVs as your data source.

One additional reason to use the CSVs as your data source: if/when you find issues in the CSVs, if you make PRs into fech-sources to fix them, they will benefit everyone, as we are all downstream from them.

Good luck! Please don't hesitate to reach out if you have any questions.

@chriszs
Copy link

chriszs commented May 27, 2023

I spent some time a year ago trying to reproduce the CSVs from the original source Excel files. Well, actually to merge the two and create JSON Schemas with typing information. This won't come as a surprise to you, but what I found is that both are fairly dirty, the CSVs are sometimes incorrect, there are a ton of records when you multiply the number of fields by the number of versions, JSON Schema has a lot of depth and it's a difficult task. Some of that work was the basis of the draft PR you used as the basis for your fech-sources contribution. A contractor for the FEC was slowly working on a similar project in their fecfile-validate project, but with a slightly different scope (just current filings). FastFEC uses a version of the two .json files, including mappings.json, which I converted from Derek's original Ruby and then Evan and I improved over time. That's as close to a clean source as you'll find, though it originally derives from the CSVs.

@chriszs
Copy link

chriszs commented May 27, 2023

Oh, also Evan is correct about F3s. There's a PDF technical manual somewhere on the FEC site which details some of this, which if I can find again I'll link to.

@NickCrews
Copy link
Author

Thank you both so much for this. Oh man I just got overwhelmed ;)

I've been looking into JSONSchema for a while, and I think I've concluded that I think it is overkill for what we need, but here are a few thoughts I had, and I thought I'd write them down.

JSONSchema musings

fecfile-validate looks as canonical as you can get. It looks like they are sourcing their schemas from the .xls files you mentioned above, but it looks like they also don't trust those .xls files and have to hand-edit them.

@chriszs by "current filings" do you mean fecfile-validate only supports filing versions 8.3+? That wouldn't be adequate for my (and I bet others) needs. I doubt the FEC will be motivated to support older versions, so we would need to supplement this.

@chriszs mentions the combinatorial explosion, but I think we could get around this by re-using sub-schemas. Am I missing something there? Still, I'm not sure if we need the full power of JSONSchema, and therefore I'm not sure if it's worth bringing in that complication. Am I right that all we need extra are the dtypes that should get parsed? Like I don't think we need the full

"form_type": {
            "title": "FORM TYPE",
            "description": "",
            "const": "SA11C",
            "examples": [
                "SA11C"
            ],
            "fec_spec": {
                "FIELD_DESCRIPTION": "FORM TYPE",
                "TYPE": "A/N-8",
                "REQUIRED": "X (error)",
                "SAMPLE_DATA": "SA11C",
                "VALUE_REFERENCE": "SA11C Only",
                "RULE_REFERENCE": null,
                "FIELD_FORM_ASSOCIATION": null
            }
        }

that JSONSchema provides.

Path Forward

OK, it sounds like updating fech-sources is what both of you are most supportive of, and I think that would work just fine for me. Adding types would be great, but just them being functional would be fine. I think the todos would be

[] merge dwillis/fech-sources#11
[] figure out dwillis/fech-sources#12
[] use the migration scripts @esonderegger wrote to bring back in the .json stuff, and add types. I can help here @esonderegger if you point me in the right direction.

CC @mjtravers from fecfile-validate, if you have any thoughts on how we could team up at all.

@freedmand
Copy link

Hi! Firstly: sorry I haven't been able to find time to get to your PR in FastFEC. (Though I have validated there is no perf difference in your version, I did notice some diffs I'm going through to try to figure out.) We will be focusing more on FEC at The Post later this year (I'm hoping to find time sooner). But I do want to chime in here to say:

  1. Great to have you working on this! I'm very curious to know what your goals/motivation are for developing this generally. I would love to be able to collaborate as effectively as possible. It's a small world of folks tackling these problems, and you've cc'd a good chunk of them. If you ever want to hop on a call to discuss any of this, I'd be game to find some time.

  2. FastFEC very much comes from translating fecfile to C, and it is downstream of all the work you've identified. It seems like you've already mostly uncovered this lineage, in addition to how the mapping files have been handed down. At some point, cleaning all this up and having a centralized source for these that's community- and/or FEC-maintained would be wonderful. I think the filings <8.3 are handled decently well by the current mappings files; at least they have been for our purposes at The Post loading many historic filings into a centralized/searchable database. But it would be worth investigating further; there's surely possible improvements.

  3. @chriszs indeed has put some time into trying to standardize the typings in a unified way. He can correct me but I think re: combinatorial expansion, there's just very minute differences between each version that would be hard to capture in a nice way, even with reusing sub-schemas. It's a painstaking process generally as the source xls files mentioned above are not always perfect. And filings themselves have various errors too despite this, so parsers may need various layers of tolerance baked in to handle weirdness (and filings can be very messy, e.g. missing columns, shifted columns, inconsistent formats, characters missing, etc.).

Looking forward to seeing what you come up with. And thanks for organizing this discussion.

@chriszs
Copy link

chriszs commented May 27, 2023

Yes, my design heavily uses sub-schemas.

Yes, there are a lot of edge cases.

Correct that fec-validate only seems interested in the current version.

I think a plan that focuses on improving the CSVs and converting from there sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants