Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some HTML-format reports are split into multiple web pages #112

Open
divergentdave opened this issue Aug 4, 2014 · 1 comment
Open

Some HTML-format reports are split into multiple web pages #112

divergentdave opened this issue Aug 4, 2014 · 1 comment

Comments

@divergentdave
Copy link
Contributor

Some reports are split up over multiple web pages, and we're only fetching the table of contents thus far. For example, http://oig.federalreserve.gov/reports/board-full-report-20140312a.htm points to an executive summary and five sub-pages. This will require multiple URLs and files per report, or maybe crawling the pages we need and stuffing them in a WARC archive, for example. Perhaps scrapers will have to provide a list of URLs for such reports, rather than a single URL.

@divergentdave
Copy link
Contributor Author

Here's a pathological corner case, a PDF file that links to more PDF files

http://www.epa.gov/oig/reports/2002/Models.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants