Skip to content

feat: add persistent worker for sphinxdocs #2938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 5, 2025

Conversation

kaycebasques
Copy link
Collaborator

@kaycebasques kaycebasques commented May 28, 2025

This implements a simple, serialized persistent worker for Sphinxdocs with several optimizations. It is enabled by default.

  • The worker computes what inputs have changed, allowing Sphinx to only rebuild what
    is necessary.
  • Doctrees are written to a separate directory so they are retained between builds.
  • The worker tells Sphinx to write output to an internal directory, then copies it
    to the expected Bazel output directory afterwards. This allows Sphinx to only
    write output files that need to be updated.

This works by having the worker compute what files have changed and having a Sphinx
extension use the get-env-outdated event to tell Sphinx which files have changed.
The extension is based on https://pwrev.dev/294057, but re-implemented to be
in-memory as part of the worker instead of a separate extension projects must configure.

For rules_python's doc building, this reduces incremental building from about 8 seconds
to about 0.8 seconds. From what I can tell, about half the time is spent generating
doctrees, and the other half generating the output files.

Worker mode is enabled by default and can be disabled on the target or by adjusting
the Bazel flags controlling execution strategy. Docs added to explain how.

Because --doctree-dir is now always specified and outside the output dir,
non-worker invocations can benefit, too, if run without sandboxing. Docs added to
explain how to do this.

Along the way:

  • Remove --write-all and --fresh-env from run args. This lets direct
    invocations benefit from the normal caching Sphinx does.
  • Change the args formatting to --foo=bar so they are a single element; just
    a bit nicer to see when debugging.

Work towards #2878, #2879

Copy link
Collaborator

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some changes that I think make this work as a basic incremental builder that relies on the builtin timestamp based detection. Not perfect, but progress!

The key change was to move the doctree directory out of the builder output directory. The builder output directory gets deleted every time the action has to re-run, which would delete the doctree files, too. Now that the doctree files are preserved between worker requests, it's able to reuse them.

@rickeylev
Copy link
Collaborator

I had an idea for registering an extension without modifying conf.py

sphinx.application.builtin_extensions is a list of extensions it always loads. We can implement an extension and add it to that list.

Communicating from the worker loop to the extension is the tricky part. I see in the prototype, digest.json is being used by sticking it in the doctree directory. This is an appealing idea! A regular global would work, too, which would avoid any serialization overhead. Regardless, however it works, the content has to be specific to the particular src/out directory because a worker could service different requests for different doc roots.

Alternatively, the extension could just hash everything itself. This would at least work in a more general sense, better than timestamps.

Copy link
Collaborator

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up the implementation heavily. Two things of note.

The worker installs an in-memory extension that implements the same basic get-env-outdated logic as in the pigweed extension. I'm not 100% sure that simple logic is 100% correct because the timestamp implementation does a lot more stuff. But, my testing didn't show errors.

There was a certain appeal to the digest.json method the original impl / pigweed extension were using, so I had the worker write a similar looking file. It passes the path to the file by using --define to set a value in the config file. Extensions can find it via app.config and looking up that config value.

@rickeylev
Copy link
Collaborator

Another finding: adding no-sandbox will help incremental building. This comes back to the lack of a persistent doctree dir. The env object is pickled into the doctree dir; the env object is what persists information (such as what files already exist, their last modify times, etc).

Even if --doctree-dir is specified, because it's not a declared input, when a sandbox'd execution occurs, the prior run's doctree directory isn't copied into the sandbox. It can't be because it's not a File input. And thus, Sphinx starts with an empty env, sees all files as new, and has to start from scratch.

Setting no-sandbox allows the files from a previous run to become visible. This saved about half the execution time (8s -> 4s) when building rule_python's docs.

Additionally, with the worker-based impl, which is able to fully persist the doctree dir and output dir between runs, incremental building is way faster: 8s -> 0.8 seconds

@rickeylev rickeylev removed the request for review from aignas June 3, 2025 23:04
@kaycebasques
Copy link
Collaborator Author

FYI I'm on staycation this week. Will review if I have some free time otherwise can review next week

@rickeylev
Copy link
Collaborator

Oh, to clarify, I'm not blocked on needing review. I can just merge it.

@rickeylev rickeylev added this pull request to the merge queue Jun 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 5, 2025
@rickeylev rickeylev added this pull request to the merge queue Jun 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 5, 2025
@rickeylev rickeylev added this pull request to the merge queue Jun 5, 2025
Merged via the queue into bazel-contrib:main with commit 0498664 Jun 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants