Caching linkcheck passing results and age

**Is your feature request related to a problem? Please describe.**
The issue is the usual rate-limiting issues that are becoming harder to work around without triggering anti-ai scrapers.

**Describe the solution you'd like**
The idea is to have a cached table of links that have been checked in previous runs with a timestamp of when those were done. Then a configure could be exposed for how old of an cache we want to re-run the linkcheck + some random fluctuation.

This file can then be cached in the GH action cache and reused across other PRs. My understanding is that only one PR needs to update this table on a "successful" run and it will propagate to other PRs even if the original one is not merged yet.

---

Here is a silly POC that you can do locally
```python
linkcheck_ignore = [
    # Random location to store ignored locations for `linkcheck_check_cache`
    r'/dev/null',
]

def linkcheck_check_cache(app: Sphinx, uri: str) -> str | None:
    # Get the cache result files
    cache_file = app.outdir / "linkcheck_cache.json"
    now = datetime.datetime.now(datetime.UTC)
    cache_file.touch()
    with cache_file.open("rt") as f:
        try:
            cache_data = json.load(f)
        except JSONDecodeError:
            cache_data = {}
    # Check if we have cached this uri yet
    if uri in cache_data:
        # Check if the cache data is recent enough
        cached_time = datetime.datetime.fromtimestamp(cache_data[uri], datetime.UTC)
        age = (now - cached_time).total_seconds()
        if age < 108000.0:
            # cache is relatively recent, so we skip this uri
            # right now we use a random location to match a hard-coded regex
            return "/dev/null"
    # If either check fails, we want to do the check and update the cache
    cache_data[uri] = now.timestamp()
    with cache_file.open("wt") as f:
        json.dump(cache_data, f)
    return uri


def setup(app: Sphinx) -> None:
    # Check a cached version of the linkcheck results
    app.connect("linkcheck-process-uri", linkcheck_check_cache)
```

It is not ideal because it does not hook to when the linkchecks are actually done or account for the success/failure of those. And it can get clogged quite easily because there are no deleting of old results. But the only way to fix those would be to either to expose a new hook for the linkcheck or implement this upstream. WDYT?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Caching linkcheck passing results and age #13568

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Caching linkcheck passing results and age #13568

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions