Add Polaris Dataset Lineage Visualization#21607
Add Polaris Dataset Lineage Visualization#21607guerler wants to merge 4 commits intogalaxyproject:devfrom
Conversation
|
Thanks for this contribution! I've reviewed the Polaris codebase in the galaxy-visualizations Architecture OverviewPolaris uses a client-side Python agent (compiled to WebAssembly via Pyodide) that:
Concerns1. API Request Volume The traversal logic makes one API call per entity fetched (datasets and jobs). In worst case with default limits, this could be ~800 2. Architectural Fit with ChatGXY Framework Given the new AI Agent Framework (PR #21434), I wonder if the lineage traversal
A 3. Long-term Sustainability The current architecture adds:
This creates maintenance burden parallel to the core agent infrastructure. Questions
Suggestions
I don't think this should block the PR - the visualization provides value and the bounds are reasonable. But I'd like to see a path |
|
that's claude btw, i absolutely think we can't hammer servers with 800 requests |
|
I also don't know how I feel about complex API interactions hosted in a repo that doesn't follow the same standards as galaxy. https://github.com/galaxyproject/galaxy-visualizations/blob/cb45e9e2ac1c6f72fcbc316936ada19fa597dc30/packages/polaris/polaris/polaris/modules/api/galaxy.py#L71 exposes the key in most server logs, that's not a good pattern, use the header as a better practice. |
|
@mvdbeek thanks for the detailed review. I appreciate the careful focus on operational impact and long-term alignment. API request volumeYou are right that the traversal is N+1 by design. The intent is to keep this read-only, client-side, and bounded so it can ship as a visualization without introducing new server-side execution paths or privileged access. Based on your feedback, I tightened the limits further:
These limits are fixed rather than user-configurable to keep behavior predictable on public instances. I will document the expected call volume and am happy to reduce bounds further if needed. This is also a transient visualization. Visualizations cannot be saved by design, so the traversal leaves no persistent artifacts, jobs, or server-side state. Authentication and API standardsTo be completely explicit: Galaxy visualizations never have access to Galaxy API keys. The query-parameter auth path was not used by the visualization at all. It was only reachable from the standalone CLI when That said, I agree this pattern is confusing to keep in the same repo. I opened a follow-up PR that removes query-parameter auth entirely and switches the CLI to header-based authentication: galaxyproject/galaxy-visualizations#145 Architectural fit with ChatGXYIt is worth clarifying up front that the API traversal itself is not AI-driven. The traversal is a deterministic, read-only walk over Galaxy APIs. The use of LLMs is limited to optional summarization of already-collected data and is not required for the visualization to function. Polaris was designed as an exploratory, self-contained visualization focused on observation and comprehension rather than execution. The core logic is written in Python, is UI-less, and produces structured Markdown output to keep traversal and reasoning explicit and inspectable. While the concrete API traversal here is intentionally client-side, the declarative structure and approach are meant to be reusable conceptually if similar capabilities are later exposed server-side. A reasonable framing is:
Long-term sustainabilityThe Pyodide-based approach is a deliberate tradeoff to keep logic transparent, versioned with the visualization, and non-privileged. The maintenance cost is local to the plugin and does not add burden to Galaxy core. If lineage logic later moves server-side, Polaris naturally simplifies. Overall, I believe the visualization provides value today with clear safeguards in place, and the follow-up changes address the concerns you raised while keeping the PR scoped and reviewable. |
that is still way too much. 5 requests are probably the maximum we can do without affecting the rest of the interface, 100 is not workable.
that only strengths the case for making the API a core galaxy feature
that i want to see 😆 |
|
@mvdbeek Sounds good to me. Thanks again for the review. I added a hard cap of 25 total traversal fetch requests in galaxyproject/galaxy-visualizations#146. I agree that lineage-style traversal ultimately belongs behind a core API endpoint, at which point the client-side traversal can be dropped entirely. |
I would go farther and say this functionality is very central to what we should be doing in Galaxy - both in terms of what we state as our core mission and in terms of what Anton and Jenn state at every team meeting as very pragmatic core concerns. I think it should not be a visualization - I think it should be a Vue component that we can integrate natively into the UI. It should appear as a link or view anywhere we are displaying dataset metadata - not buried in a list of visualization frameworks. That work is harder and requires a tougher review process and will likely be bike-shedded in ways that make the viz framework way more easy and streamlined to get such a change in - so I get why you're going this route for sure. I just want to advocate for long-term, more deeply integrated approach. I'm happy to use the viz framework as a test bed for these things but I think it belongs in native Galaxy longer term (echoing my comments on #20882 but even more so here I think). |
|
I think we all agree that lineage traversal is core Galaxy functionality and should ultimately live behind a native API and UI, not as a visualization. With that in mind, I am shifting focus to implementing a core History Graph API endpoint. Please let me know if there is already work or concrete plans in this area, otherwise I will open a focused issue to track it. |
Adds Polaris, a Galaxy visualization for exploring dataset lineage.
Polaris renders a focused, read only view of a selected dataset and recursively traverses its inputs and outputs up to a limited depth. It helps users understand how a dataset was produced, how it is connected within a history, and which jobs and intermediate datasets contribute to it.
Screen.Recording.2026-01-16.at.9.10.09.PM.mov
How to test the changes?
(Select all options that apply)
License