Skip to content

[26.0] Fix get_structure when it is passed a DCE with nested elements#22453

Open
mvdbeek wants to merge 2 commits intogalaxyproject:release_26.0from
mvdbeek:diagnose-keyerror
Open

[26.0] Fix get_structure when it is passed a DCE with nested elements#22453
mvdbeek wants to merge 2 commits intogalaxyproject:release_26.0from
mvdbeek:diagnose-keyerror

Conversation

@mvdbeek
Copy link
Copy Markdown
Member

@mvdbeek mvdbeek commented Apr 10, 2026

DatasetCollectionElement has two collection references:

  • collection: the parent collection this element belongs to
  • child_collection: the nested collection this element contains

get_structure() and walk_collections() unconditionally used .collection,
which for a DCE returns the parent (e.g. the full list:paired). Meanwhile
the job expansion correctly used child_collection (e.g. the paired
sub-collection). This mismatch caused a KeyError when the parent had more
elements than the child.

Bug scenario: given a list:paired with 3 pairs, mapping a single DCE
(one pair with forward+reverse) over a simple tool input:

get_structure(dce, "paired" desc, leaf_subcollection_type=None)

The old code did dce.collection → parent list:paired DC (3 elements),
building Tree("paired", 3 children). But the expansion used
dce.child_collection → paired DC (2 elements), creating 2 jobs.
precreate_dataset_collection iterated the 3 structure children and
accessed completed_collection[2], raising KeyError since only indices
0 and 1 existed.

Extract a _get_collection() helper that uses isinstance checks to return
child_collection for DCE and collection for HDCA/adapters, and use it in
both get_structure() and walk_collections().

Fixes #22450 and probably #22423 and #19554

Note how test_can_map_over_dce_on_non_multiple_data_param actually tested the broken behavior, this has been verifiably broken since #19900, though the broken behavior predates the PR. A pair mapped over a single input should create 2 jobs.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

… input

Add test_can_map_over_dce_from_larger_list_paired that reproduces the
KeyError when mapping a DCE (from a list:paired with 3+ pairs) over a
tool with a simple data input. The parent list collection structure was
incorrectly used instead of the child paired collection, causing an
index mismatch.

Also update test_can_map_over_dce_on_non_multiple_data_param to expect
2 jobs (one per dataset in the pair) instead of 1.

Fixes GALAXY-MAIN-4KSCZZZ0014G1
@github-actions github-actions bot added area/testing area/database Galaxy's database or data access layer area/testing/api labels Apr 10, 2026
@github-actions github-actions bot added this to the 26.1 milestone Apr 10, 2026
@mvdbeek mvdbeek force-pushed the diagnose-keyerror branch from b698478 to ba8ab68 Compare April 10, 2026 08:18
DatasetCollectionElement has two collection references:
- collection: the parent collection this element belongs to
- child_collection: the nested collection this element contains

get_structure() builds a Tree describing a collection's shape.  It
needs two things that must agree: a collection_type_description and
a DatasetCollection to enumerate elements from.

matching.for_collections() derived the type description from the
child collection (via get_child_collection), but get_structure()
unconditionally used instance.collection — which for a DCE is the
*parent*.  This mismatch went unnoticed when both happened to have
compatible element counts, but crashed when they differed.

Bug scenario: given a list:paired with 3 pairs, mapping a single DCE
(one pair with forward+reverse) over a simple tool input:

  get_structure(dce, "paired" desc, leaf_subcollection_type=None)

The old code did dce.collection → parent list:paired DC (3 elements),
building Tree("paired", 3 children). But the expansion used
dce.child_collection → paired DC (2 elements), creating 2 jobs.
precreate_dataset_collection iterated the 3 structure children and
accessed completed_collection[2], raising KeyError since only indices
0 and 1 existed.

Fix: add an explicit ``collection`` parameter to get_structure() so
callers that already resolved the contained collection (via
get_child_collection) can pass it directly.  matching.for_collections
now passes the resolved collection, keeping type description and
element source in sync.

Also add get_collection() helper (like get_child_collection in
matching.py but with isinstance checks) and use it in
walk_collections() so the walked collection matches the structure.
@mvdbeek mvdbeek marked this pull request as draft April 10, 2026 13:52
@ahmedhamidawan ahmedhamidawan moved this from Needs Review to In Progress in Galaxy Dev - weeklies Apr 14, 2026
@mvdbeek mvdbeek force-pushed the diagnose-keyerror branch from ba8ab68 to 27ca320 Compare April 14, 2026 16:28
@mvdbeek mvdbeek marked this pull request as ready for review April 14, 2026 17:21
@mvdbeek
Copy link
Copy Markdown
Member Author

mvdbeek commented Apr 14, 2026

This seems to make sense to me now, @jmchilton could you take a look ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/database Galaxy's database or data access layer area/testing/api area/testing

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants