Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the code #104

Open
waldoj opened this issue Mar 11, 2015 · 7 comments
Open

Optimize the code #104

waldoj opened this issue Mar 11, 2015 · 7 comments

Comments

@waldoj
Copy link
Member

waldoj commented Mar 11, 2015

This just takes too long to run. It's not a problem (except when debugging), but it just ain't right. Figure out how to speed this up. It shouldn't be hard.

@waldoj
Copy link
Member Author

waldoj commented Nov 22, 2015

This has become a more pressing need. For the geocoder, too.

@waldoj
Copy link
Member Author

waldoj commented Nov 22, 2015

Maybe we could use the date fields? If the date hasn't changed since date X, ignore the record? That would speed up Crump a lot.

@waldoj
Copy link
Member Author

waldoj commented Nov 22, 2015

I just realized how to make the geocoder a lot faster—actually use the --since flag. That'll just require a simple modification to the update script, to pass the date (minus 8 days) to it.

@waldoj
Copy link
Member Author

waldoj commented Nov 22, 2015

OK, the geocoder has been sped up markedly with that change.

@waldoj
Copy link
Member Author

waldoj commented Nov 22, 2015

Maybe compare the file to the prior week's file, and diff them? Then only operate on the diff?

@waldoj
Copy link
Member Author

waldoj commented Nov 22, 2015

comm might be the tool for that, but it requires that the files be sorted. And sorting a multi-million-line file isn't nothing.

@waldoj
Copy link
Member Author

waldoj commented Nov 22, 2015

I took the raw data from March and the data from this week, sorted each file, and used comm to compare them. (This was quite fast—the whole operation took maybe 3 minutes.) This found 493,790 lines that were unique to one file or the other. That's out of 1,809,192 lines. A record that changes would be listed twice—once from the old file and once from the new file. I think it's plausible that 13.5% of all records changed in an 8-month span. That's far more than I would have guessed, but I think it's within the realm of possibility. FWIW, there are 33,358 new records, so that's 230,216 records changed, or 12.7% of them. If only 19% of all businesses within these records are still active, then we'd expect 12.7% of them to change over a given 8-month span, even if just the date stamps of each record.

So if this data is OK—and it may well be—the next challenge is to figure out what to do with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant