Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a harvester to retrieve every ethics disclosure #1

Open
3 of 5 tasks
waldoj opened this issue Mar 8, 2017 · 5 comments
Open
3 of 5 tasks

Write a harvester to retrieve every ethics disclosure #1

waldoj opened this issue Mar 8, 2017 · 5 comments

Comments

@waldoj
Copy link
Member

waldoj commented Mar 8, 2017

  • increment through URLs
  • save the output as files
  • see if the server provides etags, to avoid reharvesting material
  • store hashes of each file in a hash table
  • extract the name and the filing date from each, store as metadata
waldoj added a commit that referenced this issue Mar 9, 2017
@waldoj
Copy link
Member Author

waldoj commented Mar 9, 2017

No etags, so we have no idea if content has changed, without retrieving it.

@waldoj
Copy link
Member Author

waldoj commented Mar 9, 2017

I think the hashes are best (and most quickly) calculated at the command-line, via md5, instead of with e.g. a Python file iterator.

waldoj added a commit that referenced this issue Mar 10, 2017
There is no reason to think that existing documents are ever updated.
It looks like only new ones are added. So make this resumable, using
the most recent filename. Always start at 2000, because there are none
with lower identifiers. Toward #1.
@waldoj
Copy link
Member Author

waldoj commented Mar 16, 2017

So far, this is going badly. I keep getting cut off, and the server stops responding. I don't think it's a firewall rule, because using a VPN doesn't yield any different results.

waldoj added a commit that referenced this issue Mar 16, 2017
Optionally take input from a .resume file. Skip more aggressively through 404s. Slow down the scraping time. Toward #1.
@waldoj
Copy link
Member Author

waldoj commented Mar 16, 2017

Running this again, I observed an interesting thing—99 queries worked fine. The 100th failed. Suspicious.

@waldoj
Copy link
Member Author

waldoj commented Mar 17, 2017

This time it failed after 90 queries, so I dunno.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant