Skip to content

Latest commit

 

History

History
33 lines (23 loc) · 1.91 KB

README.md

File metadata and controls

33 lines (23 loc) · 1.91 KB

Crump

Known Vulnerabilities

A parser for the Virginia State Corporation Commission's business entity records, which are provided as a single, enormous fixed-width file. Named for Beverley T. Crump, the first member of the State Corporation Commission.

Crump retrieves the current SCC records (updated weekly) and turns them into CSV and JSON. Alternately, it can improve the quality of the data (formatting dates, ZIP codes, replacing internal status codes with human-readable translations, etc.), atomize the data into millions of individual JSON files, or create Elasticsearch-compatible bulk API data.

The most recent copy of the raw SCC data can be found at https://s3.amazonaws.com/virginia-business/current.zip.

Usage

usage: crump [-h] [-a] [-i file.txt] [-o output_dir] [-t] [-d] [-e] [-m]

optional arguments:
  -h, --help            show this help message and exit
  -a, --atomize         generate millions of per-record JSON files
  -i file.txt, --input file.txt
                        raw SCC data (default: cisbemon.txt)
  -o output_dir, --output output_dir
                        directory for JSON and CSV
  -t, --transform       format properly date, ZIP, etc. fields
  -d, --download        download the data file, if missing
  -e, --elasticsearch   create Elasticsearch bulk API data
  -m, --map             generate Elasticsearch index map

For general purposes, ./crump -td is probably the best way to invoke Crump. This will download the current data file and transform the data to make it adhere to basic data quality norms.

License

Released under the MIT License.