Add Polaris Dataset Lineage Visualization by guerler · Pull Request #21607 · galaxyproject/galaxy

guerler · 2026-01-17T11:32:40Z

Adds Polaris, a Galaxy visualization for exploring dataset lineage.

Polaris renders a focused, read only view of a selected dataset and recursively traverses its inputs and outputs up to a limited depth. It helps users understand how a dataset was produced, how it is connected within a history, and which jobs and intermediate datasets contribute to it.

Screen.Recording.2026-01-16.at.9.10.09.PM.mov

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek · 2026-01-17T16:56:31Z

Thanks for this contribution! I've reviewed the Polaris codebase in the galaxy-visualizations
repo and have some observations and questions.

Architecture Overview

Polaris uses a client-side Python agent (compiled to WebAssembly via Pyodide) that:

Recursively traverses dataset lineage by making individual Galaxy API calls
Has configurable depth limits (max_depth=20, max_per_level=20 by default)
Includes UUID-based deduplication to avoid re-fetching entities
Optionally calls an LLM for workflow analysis/summary generation

Concerns

1. API Request Volume

The traversal logic makes one API call per entity fetched (datasets and jobs). In worst case with default limits, this could be ~800
sequential Galaxy API calls per visualization render. While there are bounds in place, there's no explicit rate limiting on Galaxy API
calls (only LLM calls have a token bucket limiter).

2. Architectural Fit with ChatGXY Framework

Given the new AI Agent Framework (PR #21434), I wonder if the lineage traversal
logic would be better implemented as a server-side DatasetLineageAgent:

Aspect	Current (Polaris)	Proposed (ChatGXY Agent)
Execution	Client-side (Pyodide/WASM)	Server-side Python
API calls	N+1 HTTP requests	Direct DB/service access
Rate limiting	Client-controlled	Server-controlled
LLM config	Separate	Shared `inference_services`
User context	API key in request	Full `ProvidesUserContext`

A DatasetLineageAgent could return the full lineage graph in a single API call (POST /api/ai/agents/dataset-lineage), with Polaris
becoming a thin visualization layer that just renders the response.

3. Long-term Sustainability

The current architecture adds:

Pyodide dependency (Python→WASM compilation)
Custom agent framework separate from ChatGXY
Declarative YAML pipeline system
Separate LLM configuration path

This creates maintenance burden parallel to the core agent infrastructure.

Questions

How do you envision this interacting with the ChatGXY agent framework? Should lineage be a capability the orchestrator agent can
invoke?
Have you considered a hybrid approach where the heavy lifting (traversal, LLM calls) happens server-side via ChatGXY, and Polaris
just handles visualization?
For public Galaxy instances, what safeguards exist against a user rendering lineage on a deeply nested history and generating
significant API load?

Suggestions

Document the expected API call volume and load characteristics
Consider exposing depth/max_per_level limits in the UI with guidance
Open a follow-up issue to explore migrating the backend logic to a ChatGXY DatasetLineageAgent

I don't think this should block the PR - the visualization provides value and the bounds are reasonable. But I'd like to see a path
toward consolidating this with the agent framework for long-term maintainability.

mvdbeek · 2026-01-17T16:57:00Z

that's claude btw, i absolutely think we can't hammer servers with 800 requests

mvdbeek · 2026-01-17T17:02:50Z

I also don't know how I feel about complex API interactions hosted in a repo that doesn't follow the same standards as galaxy. https://github.com/galaxyproject/galaxy-visualizations/blob/cb45e9e2ac1c6f72fcbc316936ada19fa597dc30/packages/polaris/polaris/polaris/modules/api/galaxy.py#L71 exposes the key in most server logs, that's not a good pattern, use the header as a better practice.

guerler · 2026-01-17T18:10:37Z

@mvdbeek thanks for the detailed review. I appreciate the careful focus on operational impact and long-term alignment.

API request volume

You are right that the traversal is N+1 by design. The intent is to keep this read-only, client-side, and bounded so it can ship as a visualization without introducing new server-side execution paths or privileged access. Based on your feedback, I tightened the limits further:

traversal depth capped at 10
each level capped at 10 connected entities
fixed 500 ms delay between Galaxy API calls
UUID-based deduplication to avoid re-fetching entities

These limits are fixed rather than user-configurable to keep behavior predictable on public instances. I will document the expected call volume and am happy to reduce bounds further if needed.

This is also a transient visualization. Visualizations cannot be saved by design, so the traversal leaves no persistent artifacts, jobs, or server-side state.

Authentication and API standards

To be completely explicit: Galaxy visualizations never have access to Galaxy API keys.
They authenticate exclusively via the user session and operate within normal visualization permission boundaries. There is no mechanism for a visualization to receive, store, or transmit an API key.

The query-parameter auth path was not used by the visualization at all. It was only reachable from the standalone CLI when GALAXY_KEY is explicitly provided via environment variables, and is never invoked by the Vue plugin or Galaxy-served execution.

That said, I agree this pattern is confusing to keep in the same repo. I opened a follow-up PR that removes query-parameter auth entirely and switches the CLI to header-based authentication:

galaxyproject/galaxy-visualizations#145

Architectural fit with ChatGXY

It is worth clarifying up front that the API traversal itself is not AI-driven. The traversal is a deterministic, read-only walk over Galaxy APIs. The use of LLMs is limited to optional summarization of already-collected data and is not required for the visualization to function.

Polaris was designed as an exploratory, self-contained visualization focused on observation and comprehension rather than execution. The core logic is written in Python, is UI-less, and produces structured Markdown output to keep traversal and reasoning explicit and inspectable.

While the concrete API traversal here is intentionally client-side, the declarative structure and approach are meant to be reusable conceptually if similar capabilities are later exposed server-side.

A reasonable framing is:

near term: Polaris remains a bounded client-side visualization with conservative limits
longer term: if lineage traversal becomes a shared server-side capability, Polaris can evolve into a consumer or renderer of that output

Long-term sustainability

The Pyodide-based approach is a deliberate tradeoff to keep logic transparent, versioned with the visualization, and non-privileged. The maintenance cost is local to the plugin and does not add burden to Galaxy core. If lineage logic later moves server-side, Polaris naturally simplifies.

Overall, I believe the visualization provides value today with clear safeguards in place, and the follow-up changes address the concerns you raised while keeping the PR scoped and reviewable.

mvdbeek · 2026-01-18T10:52:20Z

traversal depth capped at 10

each level capped at 10 connected entities

that is still way too much. 5 requests are probably the maximum we can do without affecting the rest of the interface, 100 is not workable.

It is worth clarifying up front that the API traversal itself is not AI-driven. The traversal is a deterministic, read-only walk over Galaxy APIs. The use of LLMs is limited to optional summarization of already-collected data and is not required for the visualization to function.

that only strengths the case for making the API a core galaxy feature

near term: Polaris remains a bounded client-side visualization with conservative limits

that i want to see 😆

guerler · 2026-01-18T13:01:48Z

@mvdbeek Sounds good to me. Thanks again for the review. I added a hard cap of 25 total traversal fetch requests in galaxyproject/galaxy-visualizations#146. I agree that lineage-style traversal ultimately belongs behind a core API endpoint, at which point the client-side traversal can be dropped entirely.

jmchilton · 2026-01-19T16:01:45Z

I agree that lineage-style traversal ultimately belongs behind a core API endpoint

I would go farther and say this functionality is very central to what we should be doing in Galaxy - both in terms of what we state as our core mission and in terms of what Anton and Jenn state at every team meeting as very pragmatic core concerns. I think it should not be a visualization - I think it should be a Vue component that we can integrate natively into the UI. It should appear as a link or view anywhere we are displaying dataset metadata - not buried in a list of visualization frameworks. That work is harder and requires a tougher review process and will likely be bike-shedded in ways that make the viz framework way more easy and streamlined to get such a change in - so I get why you're going this route for sure. I just want to advocate for long-term, more deeply integrated approach.

I'm happy to use the viz framework as a test bed for these things but I think it belongs in native Galaxy longer term (echoing my comments on #20882 but even more so here I think).

guerler · 2026-01-26T09:22:02Z

I think we all agree that lineage traversal is core Galaxy functionality and should ultimately live behind a native API and UI, not as a visualization. With that in mind, I am shifting focus to implementing a core History Graph API endpoint. Please let me know if there is already work or concrete plans in this area, otherwise I will open a focused issue to track it.

guerler added this to the 26.0 milestone Jan 17, 2026

guerler added kind/feature area/visualizations labels Jan 17, 2026

guerler marked this pull request as ready for review January 17, 2026 11:32

guerler requested a review from bgruening January 17, 2026 11:34

guerler force-pushed the polaris.000 branch from 96fa03b to 8ab8abd Compare January 17, 2026 14:05

Add Polaris Dataset Lineage Visualization

4d3948e

guerler force-pushed the polaris.000 branch from 8ab8abd to 4d3948e Compare January 17, 2026 17:09

Reduce traverse depth to 10, add delay of 500ms

4d9821f

guerler requested review from mvdbeek and removed request for bgruening January 17, 2026 18:19

Cap traversal fetch requests at 25

2fa73f7

Add truncation warning message for user

f67166d

guerler marked this pull request as draft January 23, 2026 11:43

guerler modified the milestones: 26.0, 26.1 Jan 29, 2026

guerler mentioned this pull request Feb 12, 2026

Make progress towards History Graph View #21659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Polaris Dataset Lineage Visualization#21607

Add Polaris Dataset Lineage Visualization#21607
guerler wants to merge 4 commits intogalaxyproject:devfrom
guerler:polaris.000

guerler commented Jan 17, 2026

Uh oh!

mvdbeek commented Jan 17, 2026

Uh oh!

mvdbeek commented Jan 17, 2026

Uh oh!

mvdbeek commented Jan 17, 2026

Uh oh!

guerler commented Jan 17, 2026 •

edited

Loading

Uh oh!

mvdbeek commented Jan 18, 2026

Uh oh!

guerler commented Jan 18, 2026

Uh oh!

jmchilton commented Jan 19, 2026

Uh oh!

guerler commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guerler commented Jan 17, 2026

How to test the changes?

License

Uh oh!

mvdbeek commented Jan 17, 2026

Architecture Overview

Concerns

Questions

Suggestions

Uh oh!

mvdbeek commented Jan 17, 2026

Uh oh!

mvdbeek commented Jan 17, 2026

Uh oh!

guerler commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API request volume

Authentication and API standards

Architectural fit with ChatGXY

Long-term sustainability

Uh oh!

mvdbeek commented Jan 18, 2026

Uh oh!

guerler commented Jan 18, 2026

Uh oh!

jmchilton commented Jan 19, 2026

Uh oh!

guerler commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guerler commented Jan 17, 2026 •

edited

Loading