Link checker excludes getting out of hand #35

gwct · 2025-01-09T21:02:20Z

We are getting continuous problems with link checker errors, the solution to which seems to be adding more and more domains to the list that are excluded from being checked. As the number of excluded domains grows, the utility of the link checker diminishes.

As of Jan. 9 2025, these are the excluded domains from our link checker:

Domain	Date added to exclude list	Reason for adding
`doi.org`	01.09.2025	403: Network error: Forbidden
`academic.oup.com/nar`	01.09.2025	403: Network error: Forbidden
`gnu.org`	10.25.2024	429: Network error: Too Many Requests
`anaconda.org`	04.05.2024	unknown
`fonts.gstatic.com`	12.06.2023	unknown
`www.microsoft.com/en-us/microsoft-365/onedrive/online-cloud-storage`	12.08.2023	timeout

I think some of these are justifiably excluded, like fonts.gstatic.com and the microsoft one, but others are very wide-ranging, like doi.org.

We also run into errors with the cache, which has to be manually deleted here if some links threw errors in previous runs - at least I think that is the reason. These show up as cache errors.

This issue is for discussing possible solutions to these problems. Ultimately I think the link checker is a good thing to have, but it becomes tedious to deal with these same issues each time we build the page, and the growing list of excludes defeats the purpose.

The text was updated successfully, but these errors were encountered:

gwct · 2025-01-21T17:09:40Z

From our slack discussion, @nathanweeks says:

It may be that the "too many request" responses are resulting from other GitHub Action runners performing link checking on those URLs (many GitHub Action runners behind few public IPs...). If this is happening regularly, might add 429 to the default --accept '200..=204, 429, 500' HTTP status codes to effectively ignore it. Possibly the similar cause for intermittent 403s? (maybe doi.org added GitHub Action runner public IPs to a denylist, at least temporarily?)

Regarding the cached errors: could add any errors that shouldn't be cached with the --cache-exclude-status option
Reference: https://github.com/lycheeverse/lychee?tab=readme-ov-file#commandline-parameters

Or if the broken links are few and not impacting the site, could change the github action workflow to run the link checker on demand instead of on every commit push & pull request.

gwct added the discussion Discussing various things about the website label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link checker excludes getting out of hand #35

Link checker excludes getting out of hand #35

gwct commented Jan 9, 2025

gwct commented Jan 21, 2025

Link checker excludes getting out of hand #35

Link checker excludes getting out of hand #35

Comments

gwct commented Jan 9, 2025

gwct commented Jan 21, 2025