Skip to content

Caching linkcheck passing results and age #13568

@LecrisUT

Description

@LecrisUT

Is your feature request related to a problem? Please describe.
The issue is the usual rate-limiting issues that are becoming harder to work around without triggering anti-ai scrapers.

Describe the solution you'd like
The idea is to have a cached table of links that have been checked in previous runs with a timestamp of when those were done. Then a configure could be exposed for how old of an cache we want to re-run the linkcheck + some random fluctuation.

This file can then be cached in the GH action cache and reused across other PRs. My understanding is that only one PR needs to update this table on a "successful" run and it will propagate to other PRs even if the original one is not merged yet.


Here is a silly POC that you can do locally

linkcheck_ignore = [
    # Random location to store ignored locations for `linkcheck_check_cache`
    r'/dev/null',
]

def linkcheck_check_cache(app: Sphinx, uri: str) -> str | None:
    # Get the cache result files
    cache_file = app.outdir / "linkcheck_cache.json"
    now = datetime.datetime.now(datetime.UTC)
    cache_file.touch()
    with cache_file.open("rt") as f:
        try:
            cache_data = json.load(f)
        except JSONDecodeError:
            cache_data = {}
    # Check if we have cached this uri yet
    if uri in cache_data:
        # Check if the cache data is recent enough
        cached_time = datetime.datetime.fromtimestamp(cache_data[uri], datetime.UTC)
        age = (now - cached_time).total_seconds()
        if age < 108000.0:
            # cache is relatively recent, so we skip this uri
            # right now we use a random location to match a hard-coded regex
            return "/dev/null"
    # If either check fails, we want to do the check and update the cache
    cache_data[uri] = now.timestamp()
    with cache_file.open("wt") as f:
        json.dump(cache_data, f)
    return uri


def setup(app: Sphinx) -> None:
    # Check a cached version of the linkcheck results
    app.connect("linkcheck-process-uri", linkcheck_check_cache)

It is not ideal because it does not hook to when the linkchecks are actually done or account for the success/failure of those. And it can get clogged quite easily because there are no deleting of old results. But the only way to fix those would be to either to expose a new hook for the linkcheck or implement this upstream. WDYT?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions