Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update robots.txt #642

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

Update robots.txt #642

wants to merge 1 commit into from

Conversation

benjimin
Copy link
Contributor

@benjimin benjimin commented Jan 23, 2025

Attempt to exclude web crawlers by ensuring robots.txt is fully standard compliant (no allow field, same capitalisation as standard, no wildcards) and covers both singular/plural product paths.

More explanation https://github.com/GeoscienceAustralia/dea-tech-support/discussions/36


📚 Documentation preview 📚: https://datacube-explorer--642.org.readthedocs.build/en/642/

Attempt to exclude web crawlers by ensuring robots.txt is fully standard compliant (no allow field, same capitalisation as standard, no wildcards) and covers both singular/plural product paths.
Copy link
Member

@omad omad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben. I hope this helps.

I suspect however that it'll still be inadequate. It sounds like AI Bots are still ramping up and getting more aggressive. :(

@benjimin
Copy link
Contributor Author

@omad sure, just wanting to completely rule it out (the possibility of innocent web crawlers choking on the format) before proceeding to steps that will no-doubt be more complicated (like asking AWS to explain why the traffic isn't categorised as bot by WAF).

I think Caddy is temporarily already serving the updated version for NCI Explorer (with the bonus that it does so responsively even when gunicorn is flooded), and we're intending to use config (in our helm release) to update this setting in DEA Explorer ahead of making any new datacube-explorer release.

During the last bout of NCI Explorer outage, the logs suggested traffic concentrated on the "product" path (that had been neglected here). They also indicated a diversity of browser/OS versions, a very modest number of hourly requests per IP (spread over many IPs), and a distinctive temporal pattern (24 spikes per day, at 60min intervals). It gave the impression of non-EO-specific bot traffic attempting to imitate human browser traffic, and seemed like it would be nontrivial to filter from legitimate use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants