Skip to content

Add Polaris Dataset Lineage Visualization#21607

Draft
guerler wants to merge 4 commits intogalaxyproject:devfrom
guerler:polaris.000
Draft

Add Polaris Dataset Lineage Visualization#21607
guerler wants to merge 4 commits intogalaxyproject:devfrom
guerler:polaris.000

Conversation

@guerler
Copy link
Copy Markdown
Contributor

@guerler guerler commented Jan 17, 2026

Adds Polaris, a Galaxy visualization for exploring dataset lineage.

Polaris renders a focused, read only view of a selected dataset and recursively traverses its inputs and outputs up to a limited depth. It helps users understand how a dataset was produced, how it is connected within a history, and which jobs and intermediate datasets contribute to it.

Screen.Recording.2026-01-16.at.9.10.09.PM.mov

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@guerler guerler added this to the 26.0 milestone Jan 17, 2026
@guerler guerler marked this pull request as ready for review January 17, 2026 11:32
@guerler guerler requested a review from bgruening January 17, 2026 11:34
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented Jan 17, 2026

Thanks for this contribution! I've reviewed the Polaris codebase in the galaxy-visualizations
repo
and have some observations and questions.

Architecture Overview

Polaris uses a client-side Python agent (compiled to WebAssembly via Pyodide) that:

  • Recursively traverses dataset lineage by making individual Galaxy API calls
  • Has configurable depth limits (max_depth=20, max_per_level=20 by default)
  • Includes UUID-based deduplication to avoid re-fetching entities
  • Optionally calls an LLM for workflow analysis/summary generation

Concerns

1. API Request Volume

The traversal logic makes one API call per entity fetched (datasets and jobs). In worst case with default limits, this could be ~800
sequential Galaxy API calls per visualization render. While there are bounds in place, there's no explicit rate limiting on Galaxy API
calls (only LLM calls have a token bucket limiter).

2. Architectural Fit with ChatGXY Framework

Given the new AI Agent Framework (PR #21434), I wonder if the lineage traversal
logic would be better implemented as a server-side DatasetLineageAgent:

Aspect Current (Polaris) Proposed (ChatGXY Agent)
Execution Client-side (Pyodide/WASM) Server-side Python
API calls N+1 HTTP requests Direct DB/service access
Rate limiting Client-controlled Server-controlled
LLM config Separate Shared inference_services
User context API key in request Full ProvidesUserContext

A DatasetLineageAgent could return the full lineage graph in a single API call (POST /api/ai/agents/dataset-lineage), with Polaris
becoming a thin visualization layer that just renders the response.

3. Long-term Sustainability

The current architecture adds:

  • Pyodide dependency (Python→WASM compilation)
  • Custom agent framework separate from ChatGXY
  • Declarative YAML pipeline system
  • Separate LLM configuration path

This creates maintenance burden parallel to the core agent infrastructure.

Questions

  1. How do you envision this interacting with the ChatGXY agent framework? Should lineage be a capability the orchestrator agent can
    invoke?

  2. Have you considered a hybrid approach where the heavy lifting (traversal, LLM calls) happens server-side via ChatGXY, and Polaris
    just handles visualization?

  3. For public Galaxy instances, what safeguards exist against a user rendering lineage on a deeply nested history and generating
    significant API load?

Suggestions

  • Document the expected API call volume and load characteristics
  • Consider exposing depth/max_per_level limits in the UI with guidance
  • Open a follow-up issue to explore migrating the backend logic to a ChatGXY DatasetLineageAgent

I don't think this should block the PR - the visualization provides value and the bounds are reasonable. But I'd like to see a path
toward consolidating this with the agent framework for long-term maintainability.

@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented Jan 17, 2026

that's claude btw, i absolutely think we can't hammer servers with 800 requests

@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented Jan 17, 2026

I also don't know how I feel about complex API interactions hosted in a repo that doesn't follow the same standards as galaxy. https://github.com/galaxyproject/galaxy-visualizations/blob/cb45e9e2ac1c6f72fcbc316936ada19fa597dc30/packages/polaris/polaris/polaris/modules/api/galaxy.py#L71 exposes the key in most server logs, that's not a good pattern, use the header as a better practice.

@guerler
Copy link
Copy Markdown
Contributor Author

guerler commented Jan 17, 2026

@mvdbeek thanks for the detailed review. I appreciate the careful focus on operational impact and long-term alignment.

API request volume

You are right that the traversal is N+1 by design. The intent is to keep this read-only, client-side, and bounded so it can ship as a visualization without introducing new server-side execution paths or privileged access. Based on your feedback, I tightened the limits further:

  • traversal depth capped at 10
  • each level capped at 10 connected entities
  • fixed 500 ms delay between Galaxy API calls
  • UUID-based deduplication to avoid re-fetching entities

These limits are fixed rather than user-configurable to keep behavior predictable on public instances. I will document the expected call volume and am happy to reduce bounds further if needed.

This is also a transient visualization. Visualizations cannot be saved by design, so the traversal leaves no persistent artifacts, jobs, or server-side state.

Authentication and API standards

To be completely explicit: Galaxy visualizations never have access to Galaxy API keys.
They authenticate exclusively via the user session and operate within normal visualization permission boundaries. There is no mechanism for a visualization to receive, store, or transmit an API key.

The query-parameter auth path was not used by the visualization at all. It was only reachable from the standalone CLI when GALAXY_KEY is explicitly provided via environment variables, and is never invoked by the Vue plugin or Galaxy-served execution.

That said, I agree this pattern is confusing to keep in the same repo. I opened a follow-up PR that removes query-parameter auth entirely and switches the CLI to header-based authentication:

galaxyproject/galaxy-visualizations#145

Architectural fit with ChatGXY

It is worth clarifying up front that the API traversal itself is not AI-driven. The traversal is a deterministic, read-only walk over Galaxy APIs. The use of LLMs is limited to optional summarization of already-collected data and is not required for the visualization to function.

Polaris was designed as an exploratory, self-contained visualization focused on observation and comprehension rather than execution. The core logic is written in Python, is UI-less, and produces structured Markdown output to keep traversal and reasoning explicit and inspectable.

While the concrete API traversal here is intentionally client-side, the declarative structure and approach are meant to be reusable conceptually if similar capabilities are later exposed server-side.

A reasonable framing is:

  • near term: Polaris remains a bounded client-side visualization with conservative limits
  • longer term: if lineage traversal becomes a shared server-side capability, Polaris can evolve into a consumer or renderer of that output

Long-term sustainability

The Pyodide-based approach is a deliberate tradeoff to keep logic transparent, versioned with the visualization, and non-privileged. The maintenance cost is local to the plugin and does not add burden to Galaxy core. If lineage logic later moves server-side, Polaris naturally simplifies.

Overall, I believe the visualization provides value today with clear safeguards in place, and the follow-up changes address the concerns you raised while keeping the PR scoped and reviewable.

@guerler guerler requested review from mvdbeek and removed request for bgruening January 17, 2026 18:19
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented Jan 18, 2026

  • traversal depth capped at 10
  • each level capped at 10 connected entities

that is still way too much. 5 requests are probably the maximum we can do without affecting the rest of the interface, 100 is not workable.

It is worth clarifying up front that the API traversal itself is not AI-driven. The traversal is a deterministic, read-only walk over Galaxy APIs. The use of LLMs is limited to optional summarization of already-collected data and is not required for the visualization to function.

that only strengths the case for making the API a core galaxy feature

near term: Polaris remains a bounded client-side visualization with conservative limits

that i want to see 😆

@guerler
Copy link
Copy Markdown
Contributor Author

guerler commented Jan 18, 2026

@mvdbeek Sounds good to me. Thanks again for the review. I added a hard cap of 25 total traversal fetch requests in galaxyproject/galaxy-visualizations#146. I agree that lineage-style traversal ultimately belongs behind a core API endpoint, at which point the client-side traversal can be dropped entirely.

@jmchilton
Copy link
Copy Markdown
Member

I agree that lineage-style traversal ultimately belongs behind a core API endpoint

I would go farther and say this functionality is very central to what we should be doing in Galaxy - both in terms of what we state as our core mission and in terms of what Anton and Jenn state at every team meeting as very pragmatic core concerns. I think it should not be a visualization - I think it should be a Vue component that we can integrate natively into the UI. It should appear as a link or view anywhere we are displaying dataset metadata - not buried in a list of visualization frameworks. That work is harder and requires a tougher review process and will likely be bike-shedded in ways that make the viz framework way more easy and streamlined to get such a change in - so I get why you're going this route for sure. I just want to advocate for long-term, more deeply integrated approach.

I'm happy to use the viz framework as a test bed for these things but I think it belongs in native Galaxy longer term (echoing my comments on #20882 but even more so here I think).

@guerler guerler marked this pull request as draft January 23, 2026 11:43
@guerler
Copy link
Copy Markdown
Contributor Author

guerler commented Jan 26, 2026

I think we all agree that lineage traversal is core Galaxy functionality and should ultimately live behind a native API and UI, not as a visualization. With that in mind, I am shifting focus to implementing a core History Graph API endpoint. Please let me know if there is already work or concrete plans in this area, otherwise I will open a focused issue to track it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants