Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link checker excludes getting out of hand #35

Open
gwct opened this issue Jan 9, 2025 · 1 comment
Open

Link checker excludes getting out of hand #35

gwct opened this issue Jan 9, 2025 · 1 comment
Labels
discussion Discussing various things about the website

Comments

@gwct
Copy link
Collaborator

gwct commented Jan 9, 2025

We are getting continuous problems with link checker errors, the solution to which seems to be adding more and more domains to the list that are excluded from being checked. As the number of excluded domains grows, the utility of the link checker diminishes.

As of Jan. 9 2025, these are the excluded domains from our link checker:

Domain Date added to exclude list Reason for adding
doi.org 01.09.2025 403: Network error: Forbidden
academic.oup.com/nar 01.09.2025 403: Network error: Forbidden
gnu.org 10.25.2024 429: Network error: Too Many Requests
anaconda.org 04.05.2024 unknown
fonts.gstatic.com 12.06.2023 unknown
www.microsoft.com/en-us/microsoft-365/onedrive/online-cloud-storage 12.08.2023 timeout

I think some of these are justifiably excluded, like fonts.gstatic.com and the microsoft one, but others are very wide-ranging, like doi.org.

We also run into errors with the cache, which has to be manually deleted here if some links threw errors in previous runs - at least I think that is the reason. These show up as cache errors.

This issue is for discussing possible solutions to these problems. Ultimately I think the link checker is a good thing to have, but it becomes tedious to deal with these same issues each time we build the page, and the growing list of excludes defeats the purpose.

@gwct gwct added the discussion Discussing various things about the website label Jan 9, 2025
@gwct
Copy link
Collaborator Author

gwct commented Jan 21, 2025

From our slack discussion, @nathanweeks says:

It may be that the "too many request" responses are resulting from other GitHub Action runners performing link checking on those URLs (many GitHub Action runners behind few public IPs...). If this is happening regularly, might add 429 to the default --accept '200..=204, 429, 500' HTTP status codes to effectively ignore it. Possibly the similar cause for intermittent 403s? (maybe doi.org added GitHub Action runner public IPs to a denylist, at least temporarily?)

Regarding the cached errors: could add any errors that shouldn't be cached with the --cache-exclude-status option
Reference: https://github.com/lycheeverse/lychee?tab=readme-ov-file#commandline-parameters

Or if the broken links are few and not impacting the site, could change the github action workflow to run the link checker on demand instead of on every commit push & pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing various things about the website
Projects
None yet
Development

No branches or pull requests

1 participant