Skip to content

Parallelize tool source parsing to reduce startup wall time?#21633

Draft
guerler wants to merge 1 commit intogalaxyproject:devfrom
guerler:parallel_tool_loading
Draft

Parallelize tool source parsing to reduce startup wall time?#21633
guerler wants to merge 1 commit intogalaxyproject:devfrom
guerler:parallel_tool_loading

Conversation

@guerler
Copy link
Copy Markdown
Contributor

@guerler guerler commented Jan 21, 2026

Explores a targeted mitigation for long startup times on cold, high latency filesystems by parallelizing tool XML source parsing. This affects only the IO bound XML parsing phase and provides no benefit on warm caches or fast local storage. The same pre parsed tool source objects could later be reused by a longer lived service that owns tool metadata outside of process startup.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@guerler guerler added this to the 26.0 milestone Jan 21, 2026
@guerler guerler force-pushed the parallel_tool_loading branch from d86a051 to f86388b Compare January 21, 2026 10:31
MutableMapping,
Sequence,
)
from concurrent.futures import ThreadPoolExecutor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highly doubtful that this will do anything for performance, this is CPU bound. And i would urge you not to switch to ProcessPoolExecutor, if you want to do anything here I would pick one of the options in #21247 and/or do some profiling on main's toolbox as available through cvmfs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem to be CPU bound on cold CVMFS. Profiling shows ~99 percent of time spent in blocking file reads during XML source parsing. Threading is used to overlap IO wait in this phase. No ProcessPoolExecutor is used, and Tool object creation remains serial. The executor is local and optional.

8,869 tools from cold cache:

Mode Workers Wall clock Speedup
Sequential 1 30:54 1x
Parallel 16 3:00 10x

https://github.com/guerler/galaxy/blob/tool_profiler/scripts/tool_loading_profile.md

This is orthogonal to #21247 and aligned with it. Parallel source parsing reduces cold-start now and directly applies to a future toolbox microservice that preloads and serves tool sources independently imho.

Copy link
Copy Markdown
Member

@mvdbeek mvdbeek Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/guerler/galaxy/blob/tool_profiler/scripts/tool_loading_profile.md looks comprehensive but does not reflect what I see, which is

Galaxy app startup finished (1160363.393 ms)

or 19 minutes for a cold startup with 10 workers on an M2 mac. What exactly did you time ? Your markdown document says

get_tool_source (I/O) | 1826.48s (98.6%)

but ... that's not really doing much of the work and dumping with py-spy shows most activity in building pydantic models. Note also that:

Test methodology clears system-wide CVMFS cache, which may not represent real cold start scenarios in shared environments

desc: |
If true, tool XML files will be parsed in parallel during Galaxy startup.
This can reduce startup time for instances with many tools by parallelizing
the XML reading and macro expansion phase. Set to false if you experience
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bottleneck is with the pydantic model construction, that's why job and worklflow handlers boot as normal

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On cold CVMFS the bottleneck is not pydantic model construction. For the full tool set it accounts for ~0.2 percent of total load time. The 30+ minute startup is dominated by get_tool_source IO during XML source parsing.

@guerler guerler force-pushed the parallel_tool_loading branch 4 times, most recently from e3542de to d7ab6de Compare January 22, 2026 14:30
@guerler guerler force-pushed the parallel_tool_loading branch from d7ab6de to 3c2844c Compare January 22, 2026 14:38
@guerler guerler modified the milestones: 26.0, 26.1 Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants