-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write a harvester to retrieve every ethics disclosure #1
Comments
No etags, so we have no idea if content has changed, without retrieving it. |
I think the hashes are best (and most quickly) calculated at the command-line, via |
There is no reason to think that existing documents are ever updated. It looks like only new ones are added. So make this resumable, using the most recent filename. Always start at 2000, because there are none with lower identifiers. Toward #1.
So far, this is going badly. I keep getting cut off, and the server stops responding. I don't think it's a firewall rule, because using a VPN doesn't yield any different results. |
Optionally take input from a .resume file. Skip more aggressively through 404s. Slow down the scraping time. Toward #1.
Running this again, I observed an interesting thing—99 queries worked fine. The 100th failed. Suspicious. |
This time it failed after 90 queries, so I dunno. |
The text was updated successfully, but these errors were encountered: