Write a harvester to retrieve every ethics disclosure #1

waldoj · 2017-03-08T20:00:52Z

increment through URLs
save the output as files
see if the server provides etags, to avoid reharvesting material
store hashes of each file in a hash table
extract the name and the filing date from each, store as metadata

Toward #1.

waldoj · 2017-03-09T00:28:33Z

No etags, so we have no idea if content has changed, without retrieving it.

waldoj · 2017-03-09T03:19:29Z

I think the hashes are best (and most quickly) calculated at the command-line, via md5, instead of with e.g. a Python file iterator.

There is no reason to think that existing documents are ever updated. It looks like only new ones are added. So make this resumable, using the most recent filename. Always start at 2000, because there are none with lower identifiers. Toward #1.

waldoj · 2017-03-16T02:54:22Z

So far, this is going badly. I keep getting cut off, and the server stops responding. I don't think it's a firewall rule, because using a VPN doesn't yield any different results.

Optionally take input from a .resume file. Skip more aggressively through 404s. Slow down the scraping time. Toward #1.

waldoj · 2017-03-16T12:47:33Z

Running this again, I observed an interesting thing—99 queries worked fine. The 100th failed. Suspicious.

waldoj · 2017-03-17T12:45:30Z

This time it failed after 90 queries, so I dunno.

waldoj added a commit that referenced this issue Mar 9, 2017

Initial scraper

1aadd57

Toward #1.

waldoj added a commit that referenced this issue Mar 16, 2017

Make a series of improvements in the scraper

481c869

Optionally take input from a .resume file. Skip more aggressively through 404s. Slow down the scraping time. Toward #1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a harvester to retrieve every ethics disclosure #1

Write a harvester to retrieve every ethics disclosure #1

waldoj commented Mar 8, 2017 •

edited

Loading

waldoj commented Mar 9, 2017

waldoj commented Mar 9, 2017

waldoj commented Mar 16, 2017

waldoj commented Mar 16, 2017

waldoj commented Mar 17, 2017

Write a harvester to retrieve every ethics disclosure #1

Write a harvester to retrieve every ethics disclosure #1

Comments

waldoj commented Mar 8, 2017 • edited Loading

waldoj commented Mar 9, 2017

waldoj commented Mar 9, 2017

waldoj commented Mar 16, 2017

waldoj commented Mar 16, 2017

waldoj commented Mar 17, 2017

waldoj commented Mar 8, 2017 •

edited

Loading