-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can we store static ACS data outside of the database? #44
Comments
After doing another data update for 2015 ACS data, I'm really interested in making this happen. I'd like to figure out a way to keep Census Reporter running for longer by not paying ~$40/mo in PostgreSQL data storage. |
What are some example queries made against this data? |
@migurski it is mostly just filtering, either for an explicit list of geographies and/or a containment query (geographies of type X contained by Y). The API supports selecting one or more specified columns from diverse tables (instead of just "all columns in the table) but that's a pretty specialized use case, probably rarely used. For our profile pages, we gather data from many tables. We could consider preprocessing those, especially because API users have expressed interest in being able to query the whole set of profile page figures for geographies rather than having to do the aggregation themselves. I'm pretty sure that's it. @iandees am I forgetting anything? |
The way I like to describe it is that it's a big huge spreadsheet with thousands of columns (columns are grouped into tables of related columns – these are only minorly related to DBMS column/tables) and tens of thousands of rows. In Census-speak, the columns are things like "Population" and "Median Household Income" and rows are geographies like Milwaukee, WI or the ZCTA 24450. Each cell in this spreadsheet has an estimate and a measurement of error (aka uncertainty).
(where A, B, C are columns and X, Y, Z are geographies) Like Joe said, the API supports taking chunks out of this gigantic spreadsheet by letting the user specify a comma-separated list of geographies and a comma-separated list of columns. So you might send
Like Joe said, the majority of the queries come from our profile pages, which make several requests like my example here, do some math to put the data together, and show a page like this. |
I guess you could PUT an enormous slab of data into S3, and then bake a number of b-tree indexes into it that would support range queries. I’ve never done anything like this — it’d be like network-mapping a SQLite file, I suppose, so maybe someone has done that? Seeking into the indexes will also be tricky, I expect that each one would be pretty large. What's querying the DB? Something on EC2? Do you need to retrieve the full row each time, or could this behave more like a column-oriented DB? Also for the spatial queries I guess you'd want a GIN-style inverted index of some kind of tile to each geography. |
If something on EC2 is doing the querying, what about a mounted EBS volume containing database files? SQLite or even Postgres could work. |
The current setup is one EC2 instance querying an RDS instance. The 300GB EBS volume attached to the RDS instance is currently the biggest expense we have and what I want to get rid of :). |
Ah okay, that makes sense. I’d love to play with this; it’s an interesting problem that I toyed with on S3 back when it first came out and I wanted to see if there was a way to interact with indexed data client-side. Would you ever search on values in the table other than geography, or is it really just ID lookup? |
It's only ever an ID lookup. There's currently no functionality to do things like |
The universe of GEOIDs for ACS looks to be 7,845 — any reason to not keep each of these in separate S3 objects so they can be requested individually? Then all that's left is a spatial index. |
…or, should this process also work for the much-larger 5-year estimates as well? |
There are many, many more in the 5-year, but splitting into files per geoid does make sense for the kinds of queries people would tend to do (focused on geographies). |
Right, makes sense. I’m planning to pick at a basic b-tree for GEOIDs as well as a quadtree-based spatial index for geographies. It would require something like GEOS or Shapely client-side to work. |
You can probably leave out geospatial indexing for now. We can pre-generate vector tiles for all of TIGER. |
FWIW, this is basically how we built census.ire.org. EDIT: I guess this page is closer to the mark for the JSON-per-geoid thing http://census.ire.org/docs/javascript-library.html |
Nice! Also similar to how I whacked together census-tools for the 2000 decennial: https://github.com/migurski/census-tools |
I took a swing at this on the flight; here's a gist: https://gist.github.com/migurski/659cafc98779fadb5c1e79533206372d The files on disk are encoded Latin-1 like most Census stuff, and the scripts use Python 3 for the unicode goodness with no 3rd-party libs. The index file is a simple 3-column CSV, with the GEOID, byte offset, and byte length of a record in each column. There's no particular reason the records have to be CSV, by the way. I've tested this locally with all 757MB of 2015 ACS 1-year and it’s pretty speedy retrieving records.
I have a proper git repo for this locally. |
I wrote a small Flask app to serve this index, and posted it to Heroku with data living in two places. Responses include remote data access times in headers.
With the indexes and data living on Benzene (openstreetmap.us), retrieving two geographies takes 3.75 total seconds: > curl -si 'https://census-remote-index.herokuapp.com/?geo_ids=05000US06001,04000US06' | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.769186 0.005368
X-Data-Times: 1.981628 0.996756
Content-Length: 333663
Content-Type: text/plain
ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89 With the indexes and data living on S3, retrieving two geographies takes 0.45 total seconds: > curl -si 'https://census-remote-index.herokuapp.com/?geo_ids=05000US06001,04000US06' --compress | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.073364 0.004485
X-Data-Times: 0.260381 0.115801
Content-Length: 333663
Content-Type: text/plain
ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89 With the indexes and data living at > curl -si 'http://127.0.0.1:5000/?geo_ids=05000US06001,04000US06' | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.028033 0.004092
X-Data-Times: 0.060889 0.025159
Content-Length: 333663
Content-Type: text/plain
ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89 Stuff to think about: gzipping records on S3? Storing them in JSON or GeoJSON with geometries? Is it fast enough? |
Thanks for doing this, @migurski. This is really great. I will take a closer look tomorrow, but this seems like a great start. Some first thoughts:
More tomorrow. |
Interestingly (to me), the index lookup times are only about 20% of the total so it might not be a huge savings to pre-slice the data by geography? The data this demo is hosting has about ~150KB per geography, but its columns are not marked in any sensible way. I suppose that a JSON blob per record, perhaps gzipped, might make more sense. |
I tried the gzipped JSON blob idea. Example is posted on S3 here. It's 249KB gzipped on S3 and 1.1MB expanded in the browser. Timing a fetch, decompress, parse says it takes about 0.25s each step, for a total of just under 0.75s on my laptop over wireless:
(Note that embed.censusreporter.org is a CloudFront distribution that points to the S3 bucket) So this is quite a bit slower, but it's easier to understand and leaves room in the JSON for other metadata about the geography. |
@iandees What's the plan for the full text search feature in this model? |
We can still have a database, but it doesn't need to be 300+ GB if we can put the bulk data on S3. |
Problem: we have 100s of GB of ACS data sitting around in a database that's running 24/7 and a very small amount of it is queried.
Idea: Could we store data in S3 in such a way that we could do a range request for data without having to query the database?
The text was updated successfully, but these errors were encountered: