Optimize the code #104

waldoj · 2015-03-11T18:40:26Z

This just takes too long to run. It's not a problem (except when debugging), but it just ain't right. Figure out how to speed this up. It shouldn't be hard.

waldoj · 2015-11-22T03:03:51Z

This has become a more pressing need. For the geocoder, too.

waldoj · 2015-11-22T03:50:48Z

Maybe we could use the date fields? If the date hasn't changed since date X, ignore the record? That would speed up Crump a lot.

waldoj · 2015-11-22T03:53:49Z

I just realized how to make the geocoder a lot faster—actually use the --since flag. That'll just require a simple modification to the update script, to pass the date (minus 8 days) to it.

waldoj · 2015-11-22T03:58:51Z

OK, the geocoder has been sped up markedly with that change.

waldoj · 2015-11-22T04:19:06Z

Maybe compare the file to the prior week's file, and diff them? Then only operate on the diff?

waldoj · 2015-11-22T04:25:24Z

comm might be the tool for that, but it requires that the files be sorted. And sorting a multi-million-line file isn't nothing.

waldoj · 2015-11-22T20:35:53Z

I took the raw data from March and the data from this week, sorted each file, and used comm to compare them. (This was quite fast—the whole operation took maybe 3 minutes.) This found 493,790 lines that were unique to one file or the other. That's out of 1,809,192 lines. A record that changes would be listed twice—once from the old file and once from the new file. I think it's plausible that 13.5% of all records changed in an 8-month span. That's far more than I would have guessed, but I think it's within the realm of possibility. FWIW, there are 33,358 new records, so that's 230,216 records changed, or 12.7% of them. If only 19% of all businesses within these records are still active, then we'd expect 12.7% of them to change over a given 8-month span, even if just the date stamps of each record.

So if this data is OK—and it may well be—the next challenge is to figure out what to do with it.

waldoj added the enhancement label Mar 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the code #104

Optimize the code #104

waldoj commented Mar 11, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

Optimize the code #104

Optimize the code #104

Comments

waldoj commented Mar 11, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015

waldoj commented Nov 22, 2015