Skip to content

linkcheck performance: downloading page multiple times when checking anchors #4303

@rleigh-codelibre

Description

@rleigh-codelibre

Problem

  • If my sphinx documentation contains multiple links with anchors to a web page with multiple anchors, it will download the page multiple times, once per anchor to check
  • This scales very badly. If I have many hundreds or thousands of anchors (e.g. for automatically generated documentation), it might download several megabytes × the number of links. This can end up being multiple gigabytes

Procedure to reproduce the problem

  • create a document with links to anchors on the same web page
  • run the link checker; it will fetch the page multiple times

Expected results

  • I would suggest that the link checker could cache the anchors on webpages, so that it only downloads each page once, and only checks each link once. It could build a dictionary of pages to check, and store the anchors as a list or dict within it? Since we know up front which of our links have anchors, we can skip storing them when we know it's unnecessary.
  • There may be other better ways of doing this; I'm not familiar with the internals of the link checker.

Reproducible project / your project

Environment info

  • OS: Any
  • Python version: Any
  • Sphinx version: Any

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions