Please see the Add-On documentation
This simple DocumentCloud scraper Add-On will monitor a given site for documents and upload them to your DocumentCloud account, alerting you to any documents that meet given keyword criteria.
Documents that are scraped are tracked in a data.json file which is checked in to the repository. If you copy this template or fork this repository, you may want to delete that file before pointing the scraper to a new site.
Important Note: Because of the way GitHub works, you might stumble upon
these directions in a variety of directories. The canonical version lives at
https://github.com/MuckRock/documentcloud-scraper-addon
so if you're on a
different page and you're new to GitHub and DocumentCloud Add-Ons, we recommend
going to that page for the latest instructions and most straight-forward flow.
Down the road, you might want to build off other versions, but always check to
make sure you trust and can verify the creators of the code.
First, you'll need to have a verified MuckRock account. If you've ever uploaded documents to DocumentCloud before, you're already set. If not, register a free account here and then request verification here.
Next, log in to DocumentCloud and create a new project to store the documents that your scraper grabs.
Click on your newly created project on the left-hand side of the screen, and
note the numbers to the right of its name — this is the project ID, in this
example, 207354
.
Click on the Add-Ons dropdown menu -> "Browse All Add-Ons" -> "Scraper" -> Click the inactive button to mark the Add-On as active and finally hit Done. Click on the Add-Ons dropdown menu once more and click on the Scraper which will now be active.
If successful, the Add-On will grab all the documents it can pull from the site, load them into DocumentCloud, and then send you an email. It will now run hourly and will only alert you if it pulls new documents, with a second alert highlighting any documents that meet your key terms.
This is a relatively simple Add-On, but one of the powerful things about this approach is that it can be mixed and matched with other tools. Once your comfortable with the basics, you can explore other example Add-Ons that let you automatically extract data, use machine learning to classify documents into categories, and more. Subscribe to the DocumentCloud newsletter to get more examples of code and opportunities to get help building out tools that help your newsroom needs.