Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Crawling Issues Found In Legacy Spider #97

Open
selfdanielj opened this issue Jan 24, 2025 · 0 comments
Open

Investigate Crawling Issues Found In Legacy Spider #97

selfdanielj opened this issue Jan 24, 2025 · 0 comments

Comments

@selfdanielj
Copy link
Contributor

selfdanielj commented Jan 24, 2025

Summary

This issue lists some known Issues at the time of release that need to be considered when developing Jemison. They come from this doc: Deploying Spider to Production

Issues

These are specific issues but represent general concerns for any site we need to crawl.

Cannot crawl drive.hhs.gov

Issue: Crawling the drive.hhs.gov domain starting at https://drive.hhs.gov/ produces 0 URLs.
Cause: Their robots.txt is excluding unknown user agents from crawling their entire site:

User-agent: *
Disallow: /

Resolution: Unknown

Cannot crawl allhands.navy.mil

Issue: Crawling the allhands.navy.mil domain starting at https://allhands.navy.mil produces 0 URLs. A 403 status is returned from the start URL. A 403 is similarly seen when using curl to interact with the site. A browser works as does using curl with the user agent set to mimic a browser.
Cause: Unsure, perhaps WAF or cloudfront rules blocking these requests?
Resolution: Unknown

Long Running & Ineffective Domains

Issue: A few domains, such as toolkit.climate.gov, hsrd.research.va.gov, but especially coastwatch.noaa.gov run for an extremely long time, often going long periods without returning any URLs.
Cause: Ultimately unknown due to lack of visibility into Scrapy internal but likely there are very deep directories full of data files that we are crawling but ultimately ignoring.
Resolution: A few options exist, including adding code to stop the scrape if we don’t get any new URLs for a long period of time (e.g. 24 hours), investigate the individual causes and update the code to exclude specific paths, or exclude these domains from the spider crawl list.

@selfdanielj selfdanielj converted this from a draft issue Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: triage
Development

No branches or pull requests

1 participant