You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue lists some known Issues at the time of release that need to be considered when developing Jemison. They come from this doc: Deploying Spider to Production
Issues
These are specific issues but represent general concerns for any site we need to crawl.
Cannot crawl drive.hhs.gov
Issue: Crawling the drive.hhs.gov domain starting at https://drive.hhs.gov/ produces 0 URLs. Cause: Their robots.txt is excluding unknown user agents from crawling their entire site:
User-agent: *
Disallow: /
Resolution: Unknown
Cannot crawl allhands.navy.mil
Issue: Crawling the allhands.navy.mil domain starting at https://allhands.navy.mil produces 0 URLs. A 403 status is returned from the start URL. A 403 is similarly seen when using curl to interact with the site. A browser works as does using curl with the user agent set to mimic a browser. Cause: Unsure, perhaps WAF or cloudfront rules blocking these requests? Resolution: Unknown
Long Running & Ineffective Domains
Issue: A few domains, such as toolkit.climate.gov, hsrd.research.va.gov, but especially coastwatch.noaa.gov run for an extremely long time, often going long periods without returning any URLs. Cause: Ultimately unknown due to lack of visibility into Scrapy internal but likely there are very deep directories full of data files that we are crawling but ultimately ignoring. Resolution: A few options exist, including adding code to stop the scrape if we don’t get any new URLs for a long period of time (e.g. 24 hours), investigate the individual causes and update the code to exclude specific paths, or exclude these domains from the spider crawl list.
The text was updated successfully, but these errors were encountered:
Summary
This issue lists some known Issues at the time of release that need to be considered when developing Jemison. They come from this doc: Deploying Spider to Production
Issues
These are specific issues but represent general concerns for any site we need to crawl.
Cannot crawl drive.hhs.gov
Issue: Crawling the drive.hhs.gov domain starting at https://drive.hhs.gov/ produces 0 URLs.
Cause: Their robots.txt is excluding unknown user agents from crawling their entire site:
Resolution: Unknown
Cannot crawl allhands.navy.mil
Issue: Crawling the allhands.navy.mil domain starting at https://allhands.navy.mil produces 0 URLs. A 403 status is returned from the start URL. A 403 is similarly seen when using curl to interact with the site. A browser works as does using curl with the user agent set to mimic a browser.
Cause: Unsure, perhaps WAF or cloudfront rules blocking these requests?
Resolution: Unknown
Long Running & Ineffective Domains
Issue: A few domains, such as toolkit.climate.gov, hsrd.research.va.gov, but especially coastwatch.noaa.gov run for an extremely long time, often going long periods without returning any URLs.
Cause: Ultimately unknown due to lack of visibility into Scrapy internal but likely there are very deep directories full of data files that we are crawling but ultimately ignoring.
Resolution: A few options exist, including adding code to stop the scrape if we don’t get any new URLs for a long period of time (e.g. 24 hours), investigate the individual causes and update the code to exclude specific paths, or exclude these domains from the spider crawl list.
The text was updated successfully, but these errors were encountered: