Investigate Crawling Issues Found In Legacy Spider #97

selfdanielj · 2025-01-24T19:03:15Z

Summary

This issue lists some known Issues at the time of release that need to be considered when developing Jemison. They come from this doc: Deploying Spider to Production

Issues

These are specific issues but represent general concerns for any site we need to crawl.

Cannot crawl drive.hhs.gov

Issue: Crawling the drive.hhs.gov domain starting at https://drive.hhs.gov/ produces 0 URLs.
Cause: Their robots.txt is excluding unknown user agents from crawling their entire site:

User-agent: *
Disallow: /

Resolution: Unknown

Cannot crawl allhands.navy.mil

Issue: Crawling the allhands.navy.mil domain starting at https://allhands.navy.mil produces 0 URLs. A 403 status is returned from the start URL. A 403 is similarly seen when using curl to interact with the site. A browser works as does using curl with the user agent set to mimic a browser.
Cause: Unsure, perhaps WAF or cloudfront rules blocking these requests?
Resolution: Unknown

Long Running & Ineffective Domains

Issue: A few domains, such as toolkit.climate.gov, hsrd.research.va.gov, but especially coastwatch.noaa.gov run for an extremely long time, often going long periods without returning any URLs.
Cause: Ultimately unknown due to lack of visibility into Scrapy internal but likely there are very deep directories full of data files that we are crawling but ultimately ignoring.
Resolution: A few options exist, including adding code to stop the scrape if we don’t get any new URLs for a long period of time (e.g. 24 hours), investigate the individual causes and update the code to exclude specific paths, or exclude these domains from the spider crawl list.

The text was updated successfully, but these errors were encountered:

selfdanielj added this to jemison Jan 24, 2025

selfdanielj converted this from a draft issue Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Crawling Issues Found In Legacy Spider #97

Investigate Crawling Issues Found In Legacy Spider #97

selfdanielj commented Jan 24, 2025 •

edited

Loading

Investigate Crawling Issues Found In Legacy Spider #97

Investigate Crawling Issues Found In Legacy Spider #97

Comments

selfdanielj commented Jan 24, 2025 • edited Loading

Summary

Issues

Cannot crawl drive.hhs.gov

Cannot crawl allhands.navy.mil

Long Running & Ineffective Domains

selfdanielj commented Jan 24, 2025 •

edited

Loading