How can we store static ACS data outside of the database? #44

iandees · 2016-05-14T23:35:31Z

Problem: we have 100s of GB of ACS data sitting around in a database that's running 24/7 and a very small amount of it is queried.

Idea: Could we store data in S3 in such a way that we could do a range request for data without having to query the database?

iandees · 2016-09-20T20:30:29Z

After doing another data update for 2015 ACS data, I'm really interested in making this happen. I'd like to figure out a way to keep Census Reporter running for longer by not paying ~$40/mo in PostgreSQL data storage.

/cc @migurski @jedsundwall

migurski · 2016-09-20T20:52:10Z

What are some example queries made against this data?

JoeGermuska · 2016-09-20T23:30:01Z

@migurski it is mostly just filtering, either for an explicit list of geographies and/or a containment query (geographies of type X contained by Y).

The API supports selecting one or more specified columns from diverse tables (instead of just "all columns in the table) but that's a pretty specialized use case, probably rarely used.

For our profile pages, we gather data from many tables. We could consider preprocessing those, especially because API users have expressed interest in being able to query the whole set of profile page figures for geographies rather than having to do the aggregation themselves.

I'm pretty sure that's it. @iandees am I forgetting anything?

iandees · 2016-09-21T00:36:08Z

The way I like to describe it is that it's a big huge spreadsheet with thousands of columns (columns are grouped into tables of related columns – these are only minorly related to DBMS column/tables) and tens of thousands of rows. In Census-speak, the columns are things like "Population" and "Median Household Income" and rows are geographies like Milwaukee, WI or the ZCTA 24450. Each cell in this spreadsheet has an estimate and a measurement of error (aka uncertainty).

	A	B	C	…
X	e / u	e / u	e / u	…
Y	e / u	e / u	e / u	…
Z	e / u	e / u	e / u	…
…	…	…	…	…

(where A, B, C are columns and X, Y, Z are geographies)

Like Joe said, the API supports taking chunks out of this gigantic spreadsheet by letting the user specify a comma-separated list of geographies and a comma-separated list of columns. So you might send ?geo_ids=X,Z&table_ids=B you'd get the highlighted data:

	A	B	C	…
X		e / u		…
Y				…
Z		e / u		…
…	…	…	…	…

Like Joe said, the majority of the queries come from our profile pages, which make several requests like my example here, do some math to put the data together, and show a page like this.

migurski · 2016-09-21T01:06:31Z

I guess you could PUT an enormous slab of data into S3, and then bake a number of b-tree indexes into it that would support range queries. I’ve never done anything like this — it’d be like network-mapping a SQLite file, I suppose, so maybe someone has done that? Seeking into the indexes will also be tricky, I expect that each one would be pretty large. What's querying the DB? Something on EC2?

Do you need to retrieve the full row each time, or could this behave more like a column-oriented DB?

Also for the spatial queries I guess you'd want a GIN-style inverted index of some kind of tile to each geography.

migurski · 2016-09-21T01:18:46Z

If something on EC2 is doing the querying, what about a mounted EBS volume containing database files? SQLite or even Postgres could work.

iandees · 2016-09-21T01:21:06Z

The current setup is one EC2 instance querying an RDS instance. The 300GB EBS volume attached to the RDS instance is currently the biggest expense we have and what I want to get rid of :).

migurski · 2016-09-21T01:31:04Z

Ah okay, that makes sense. I’d love to play with this; it’s an interesting problem that I toyed with on S3 back when it first came out and I wanted to see if there was a way to interact with indexed data client-side. Would you ever search on values in the table other than geography, or is it really just ID lookup?

iandees · 2016-09-21T01:34:54Z

It's only ever an ID lookup. There's currently no functionality to do things like WHERE population > 50000, only WHERE geo_id IN ('04055', '04023').

migurski · 2016-09-21T20:18:12Z

The universe of GEOIDs for ACS looks to be 7,845 — any reason to not keep each of these in separate S3 objects so they can be requested individually? Then all that's left is a spatial index.

migurski · 2016-09-21T20:20:09Z

…or, should this process also work for the much-larger 5-year estimates as well?

iandees · 2016-09-21T20:22:58Z

There are many, many more in the 5-year, but splitting into files per geoid does make sense for the kinds of queries people would tend to do (focused on geographies).

migurski · 2016-09-21T20:24:04Z

Right, makes sense. I’m planning to pick at a basic b-tree for GEOIDs as well as a quadtree-based spatial index for geographies. It would require something like GEOS or Shapely client-side to work.

iandees · 2016-09-21T20:25:41Z

You can probably leave out geospatial indexing for now. We can pre-generate vector tiles for all of TIGER.

JoeGermuska · 2016-09-21T20:29:27Z

FWIW, this is basically how we built census.ire.org.
http://census.ire.org/data/bulkdata.html

EDIT: I guess this page is closer to the mark for the JSON-per-geoid thing http://census.ire.org/docs/javascript-library.html

migurski · 2016-09-21T20:34:59Z

Nice! Also similar to how I whacked together census-tools for the 2000 decennial: https://github.com/migurski/census-tools

migurski · 2016-09-22T09:48:52Z

I took a swing at this on the flight; here's a gist: https://gist.github.com/migurski/659cafc98779fadb5c1e79533206372d

The files on disk are encoded Latin-1 like most Census stuff, and the scripts use Python 3 for the unicode goodness with no 3rd-party libs. The index file is a simple 3-column CSV, with the GEOID, byte offset, and byte length of a record in each column. There's no particular reason the records have to be CSV, by the way. I've tested this locally with all 757MB of 2015 ACS 1-year and it’s pretty speedy retrieving records.

I have a proper git repo for this locally.

migurski · 2016-09-24T22:07:06Z

I wrote a small Flask app to serve this index, and posted it to Heroku with data living in two places. Responses include remote data access times in headers.

adjoined-index.csv and adjoined.csv on Benzene.
adjoined-index.csv and adjoined.csv on S3.

With the indexes and data living on Benzene (openstreetmap.us), retrieving two geographies takes 3.75 total seconds:

> curl -si 'https://census-remote-index.herokuapp.com/?geo_ids=05000US06001,04000US06' | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.769186 0.005368
X-Data-Times: 1.981628 0.996756
Content-Length: 333663
Content-Type: text/plain

ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89

With the indexes and data living on S3, retrieving two geographies takes 0.45 total seconds:

> curl -si 'https://census-remote-index.herokuapp.com/?geo_ids=05000US06001,04000US06' --compress | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.073364 0.004485
X-Data-Times: 0.260381 0.115801
Content-Length: 333663
Content-Type: text/plain

ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89

With the indexes and data living at http://localhost, retrieving two geographies takes 0.12 total seconds, as a baseline:

> curl -si 'http://127.0.0.1:5000/?geo_ids=05000US06001,04000US06' | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.028033 0.004092
X-Data-Times: 0.060889 0.025159
Content-Length: 333663
Content-Type: text/plain

ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89

Stuff to think about: gzipping records on S3? Storing them in JSON or GeoJSON with geometries? Is it fast enough?

iandees · 2016-09-25T03:26:54Z

Thanks for doing this, @migurski. This is really great. I will take a closer look tomorrow, but this seems like a great start.

Some first thoughts:

To match our existing API (which requires that you specify column(s) as well as geo_ids) we'd want to make one more lookup to find the column in the resulting CSV that your tool currently returns. This lookup would only need to happen once per API call.
It probably makes sense to stack all the Census columns into one gigantic CSV so we don't have to figure out which sequence file they're in. At this point, it might make sense to split each geography into its own file like you were suggesting earlier. That would mean tens of thousands of S3 PUTs, but it would save an index lookup.
If we store each geography as its own file, we could also store its GeoJSON in the same file.

More tomorrow.

migurski · 2016-09-25T08:35:01Z

Interestingly (to me), the index lookup times are only about 20% of the total so it might not be a huge savings to pre-slice the data by geography? The data this demo is hosting has about ~150KB per geography, but its columns are not marked in any sensible way. I suppose that a JSON blob per record, perhaps gzipped, might make more sense.

iandees · 2016-09-26T02:42:39Z

I tried the gzipped JSON blob idea. Example is posted on S3 here. It's 249KB gzipped on S3 and 1.1MB expanded in the browser.

Timing a fetch, decompress, parse says it takes about 0.25s each step, for a total of just under 0.75s on my laptop over wireless:

$ time curl -s https://embed.censusreporter.org/test/04000US02.json | gzip -d | jq .tables.B00001.estimate.B00001001 && echo
19940
curl -s https://embed.censusreporter.org/test/04000US02.json  0.03s user 0.02s system 16% cpu 0.254 total
gzip -d  0.01s user 0.00s system 3% cpu 0.254 total
jq .tables.B00001.estimate.B00001001  0.05s user 0.01s system 21% cpu 0.262 total

(Note that embed.censusreporter.org is a CloudFront distribution that points to the S3 bucket)

So this is quite a bit slower, but it's easier to understand and leaves room in the JSON for other metadata about the geography.

JoeGermuska · 2016-09-26T02:53:01Z

@iandees What's the plan for the full text search feature in this model?

iandees · 2016-09-26T02:54:05Z

We can still have a database, but it doesn't need to be 300+ GB if we can put the bulk data on S3.

iandees added the enhancement label May 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we store static ACS data outside of the database? #44

How can we store static ACS data outside of the database? #44

iandees commented May 14, 2016

iandees commented Sep 20, 2016

migurski commented Sep 20, 2016

JoeGermuska commented Sep 20, 2016

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016

migurski commented Sep 21, 2016

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016 •

edited

Loading

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016 •

edited

Loading

migurski commented Sep 21, 2016

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016

iandees commented Sep 21, 2016

JoeGermuska commented Sep 21, 2016 •

edited

Loading

migurski commented Sep 21, 2016

migurski commented Sep 22, 2016 •

edited

Loading

migurski commented Sep 24, 2016 •

edited

Loading

iandees commented Sep 25, 2016

migurski commented Sep 25, 2016

iandees commented Sep 26, 2016 •

edited

Loading

JoeGermuska commented Sep 26, 2016

iandees commented Sep 26, 2016

How can we store static ACS data outside of the database? #44

How can we store static ACS data outside of the database? #44

Comments

iandees commented May 14, 2016

iandees commented Sep 20, 2016

migurski commented Sep 20, 2016

JoeGermuska commented Sep 20, 2016

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016

migurski commented Sep 21, 2016

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016 • edited Loading

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016 • edited Loading

migurski commented Sep 21, 2016

iandees commented Sep 21, 2016

migurski commented Sep 21, 2016

iandees commented Sep 21, 2016

JoeGermuska commented Sep 21, 2016 • edited Loading

migurski commented Sep 21, 2016

migurski commented Sep 22, 2016 • edited Loading

migurski commented Sep 24, 2016 • edited Loading

iandees commented Sep 25, 2016

migurski commented Sep 25, 2016

iandees commented Sep 26, 2016 • edited Loading

JoeGermuska commented Sep 26, 2016

iandees commented Sep 26, 2016

migurski commented Sep 21, 2016 •

edited

Loading

migurski commented Sep 21, 2016 •

edited

Loading

JoeGermuska commented Sep 21, 2016 •

edited

Loading

migurski commented Sep 22, 2016 •

edited

Loading

migurski commented Sep 24, 2016 •

edited

Loading

iandees commented Sep 26, 2016 •

edited

Loading