Create a new merge() function, and use that for concatenate #3183

hyanwong · 2025-05-29T17:14:10Z

Description

It appears as if union wasn't really the right underlying function to use for concatenate, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own merge utility.

Fixes #3181

PR Checklist:

Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate.
Changelogs, if there are API changes.

codecov · 2025-05-29T17:20:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.76%. Comparing base (9d2ff8e) to head (294dd89).
Report is 2 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (9d2ff8e) and HEAD (294dd89). Click for more details.

HEAD has 6 uploads less than BASE

Flag BASE (9d2ff8e) HEAD (294dd89)

python-tests 6 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3183       +/-   ##
===========================================
- Coverage   98.91%   86.76%   -12.16%     
===========================================
  Files          17       11        -6     
  Lines        7542    24427    +16885     
  Branches     1258     4615     +3357     
===========================================
+ Hits         7460    21193    +13733     
- Misses         33     1853     +1820     
- Partials       49     1381     +1332

Flag	Coverage Δ
c-tests	`86.66% <ø> (?)`
lwt-tests	`80.38% <ø> (?)`
python-c-tests	`88.18% <ø> (?)`
python-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

petrelharp · 2025-05-29T17:28:32Z

I strongly vote for a more descriptive name than merge, since (as we saw with union) a generic name can lead to wrong expectations. Why not union_edges or merge_edges? (this also merges eg mutations but we could think of mutations as being carried along with the edge they're on)

I also wonder if we actually want the TreeSequence method, as these kinds of operations are best done on tables?

Might you be able to write down an explicit list of what exactly gets added, in the style of union?

Other things needed for the docs:

a warning that the result might not be a legal tree sequence
more explicit description of the arguments in the TreeSequence method

petrelharp · 2025-05-29T17:29:44Z

We also need some tests that use the WF-with-overlapping-gens sims (and so thus might have some samples that are parents of others, etc) - I think we have this built in to tsutil.

hyanwong · 2025-05-29T17:37:55Z

Thanks @petrelharp

I strongly vote for a more descriptive name than merge

Yep, I can see that argument. I don't think of this as purely edge-based, however: it also merges mutations, sites, migrations, etc.

I also wonder if we actually want the TreeSequence method

The tree sequence method comes along for free, but we could easily 'nix it. We tend to expose these table methods to tree-sequence space too, like union, so it seemed remiss not to?

a warning that the result might not be a legal tree sequence

I think it always will be, because of the sort() call? Otherwise that will fail, I think.

more explicit description of the arguments in the TreeSequence method

Good catch: I forgot that and have now added it.

hyanwong · 2025-05-29T17:40:31Z

p.s. I set add_populations=False in the concatenate method, overriding the default, as I expect concatenate to be used e.g. on multiple chromosomes with the same population definitions, and it would be confusing to create new populations for each chromosome, I feel.

Defaulting to add_populations=True for the merge method seems OK though, as this is more of a low-level thing, and matches union.

hyanwong · 2025-05-29T17:47:59Z

Might you be able to write down an explicit list of what exactly gets added, in the style of union?

Good idea. I'll do that, and add some WF-with-overlapping-gens type tests (I have hacked a few with internal samples in there already, but I can add one or two more)

hyanwong · 2025-05-29T19:34:01Z

Added the following list, and a couple of WF tests for concatenate. We still need to agree on a different name than "merge" though...

Items that will be ported from ``other`` into ``self`` (and given new IDs) are:

1. All edges in ``other``
2. All migrations in ``other``
3. All mutations in ``other``
4. Sites whose positions are new to ``self``
5. Individuals whose nodes are new to ``self``.
6. If ``add_populations=True``, populations whose nodes are new to ``self``

hyanwong · 2025-05-29T21:08:00Z

We still need to agree on a different name than "merge" though

Here's a suggestion for reducing confusion. I implement a 'union' like part to the function in this PR which skips merging edges if they are identical. Then we call this one union_edges and rename the existing union to union_nodes. We keep the name union as a deprecated (but permanently maintained) alias to union_nodes (we could or could not document it, I suppose)

It appears as if `union` wasn't really the right underlying function to use for `concatenate`, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own `merge` utility.

petrelharp · 2025-05-30T13:36:17Z

a warning that the result might not be a legal tree sequence
I think it always will be, because of the sort() call? Otherwise that will fail, I think.

Yes, that's the point - I'm imagining a sentence after the list of things that will be added which says "If this results in an invalid tree sequence, this will produce an error."

python/tskit/tables.py

petrelharp · 2025-05-30T13:48:42Z

python/tskit/tables.py

+                    raise ValueError("Non-matching populations")
+        individual_map = {}
+        population_map = {}
+        for new_node in np.where(node_map == tskit.NULL)[0]:


Perhaps we should check that the nodes in the node map are identical? So something like

for oj, sj in enumerate(node_map): if sj != tskit.NULL: if self.node(sj) != other.node(oj): raise ValueError("Nodes do not match...") else: # continue as below

Yes, I guess this could be an optional check though. I can imagine that there might be different node metadata sometimes. Perhaps the check_populations should be check_objects instead? Then we could also check that overlapping sites have identical metadata, etc.

petrelharp · 2025-05-30T13:56:15Z

TODO: put this in https://tskit.dev/tskit/docs/latest/python-api.html#modification

petrelharp · 2025-05-30T13:59:51Z

python/tskit/tables.py

+            self.edges.append(
+                e.replace(child=node_map[e.child], parent=node_map[e.parent])
+            )
+        site_map = {}


You could save some trouble here by doing

ns = self.sites.num_rows self.sites.append_columns(**other.sites.asdict()) for mut in other.mutations: self.mutations.append( mut.replace( node=node_map[mut.node], site=mut.site + ns, parent=tskit.NULL ) # then after the self.sort() self.deduplicate_sites()

Aha, I forgot about deduplicate_sites, thanks.

While I'm at it, is there a way of figuring out quickly if edges already exist in the tree sequence? There's no deduplicate_edges functionality, or similar, right? And sorting edges with duplicates will simply error out.

hm; no, not unsorted. See my proposal in the main thread - I think we should not worry about the cases where some edges are shared (unless we have a use case?).

petrelharp · 2025-05-30T14:01:00Z

Items that will be ported from other into self (and given new IDs) are:

All edges in other

All migrations in other

All mutations in other

Sites whose positions are new to self

Individuals whose nodes are new to self.

If add_populations=True, populations whose nodes are new to ``self`

This list is missing the new nodes thenselves? (those for which node_mapping is NULL); otherwise, looks good!

petrelharp · 2025-05-30T14:03:33Z

A general question is how much checking we do: do we check for equality of nodes that are the same? Of sites? (they could have metadata) If we allow overlapping stuff, what about mutations?

I think by default we do check it - the checking with union is very annoying but in nearly all cases where it fails because of the check_shared_overlap, the thing the person was trying to do was not right.

petrelharp · 2025-05-30T14:05:55Z

About the name - I do think this could be called "edgewise" (even though it does mutations and migrations) because those could be thought of as living on edges?

I guess I'm up for the renaming of union to union_node (should it be union_nodes?), although maybe it's confusing because it adds on more than just nodes, just as this one adds on more than just edges. I need to think about this.

Co-authored-by: Peter Ralph <[email protected]>

petrelharp · 2025-05-31T20:18:49Z

Thinking out loud here: what's the difference between this and union? (I'll call "this" merge here.)
Comparing union to the list of what's added here:

merge will add edges between existing nodes; union won't.
merge will add mutations above existing nodes; union won't.
merge and union add the same individuals; neither adds new individuals whose nodes are all shared between the two.

Note in particular that merge does strictly more than union.

So: merge just basically goes ahead and adds everything (except not all individuals); whereas union only adds stuff related to new nodes. The reason for this is the different use cases:

union is designed for the case where they share history entirely past some point but have diverged more recently;
merge is designed for the case when they describe dijoint sections of genome.

What would union do, abstractly? Well, if a set A is the disjoint union of X and Y, and B is the disjoint union of Y and Z; then unioning A and B adds Z to A; so a fully general tree sequence union would need to figure out what Y is (ie what's the intersection). This is in general a hard (at least annoying) problem and we don't have use cases for the general case, so we're not trying to do this. Instead, we're specializing to these two special cases of union, where (union) Y is "attributes of and relationships between a given set of existing nodes"; and (merge) Y is "a given set of existing nodes".

So - given that merge does strictly more than union, and otherwise they are quite similar, it seems like a good idea to just add a new argument to union which toggles the behavior of merge. Probably something qutie simple like ts.union(other, all_edges=True, all_mutations=True)? These would turn off check_shared_equality. (I thought about calling it disjoint=True; or disjoint_edges=True or some such, since we're basically assuming that the edges to be disjoint, but this is more straightforward.) I think this would be straightforward: just changing this condition and this condition.

How does this sound?

petrelharp · 2025-06-01T00:26:32Z

Note: I found another current use of union: https://tskit.dev/pyslim/docs/latest/vignette_continuing.html#continuing-the-simulation; this one could be done with either union or merge.

hyanwong · 2025-06-01T07:59:04Z

How does this sound?

I like the idea of not creating more confusion by similarly named functions, so something like that SGTM.

I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?

petrelharp · 2025-06-01T14:59:45Z

I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?

Good question. I believe the current union does this by check_shared_overlap (but we can't use that for the new option).

Want me to have a go at the C version?

hyanwong · 2025-06-01T15:28:05Z

Want me to have a go at the C version?

I would love it if you have time, thanks so much.

I'm happy to modify my python tests to check it works as intended.

hyanwong · 2025-06-11T12:18:29Z

I converted this to draft, as @petrelharp 's suggestion is different, but we should use the tests from this one.

hyanwong mentioned this pull request May 29, 2025

concatenate is buggy with internal samples #3182

Open

hyanwong force-pushed the concat-fix branch from c8dad24 to 62e0b4c Compare May 29, 2025 17:28

hyanwong force-pushed the concat-fix branch 2 times, most recently from 9bded44 to 8296e36 Compare May 29, 2025 19:32

hyanwong force-pushed the concat-fix branch 2 times, most recently from 3dce552 to 4b7b5b2 Compare May 29, 2025 20:15

Create a new merge() function, and use that for concatenate

6944250

It appears as if `union` wasn't really the right underlying function to use for `concatenate`, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own `merge` utility.

hyanwong force-pushed the concat-fix branch from 4b7b5b2 to 6944250 Compare May 30, 2025 07:33

petrelharp reviewed May 30, 2025

View reviewed changes

python/tskit/tables.py Outdated Show resolved Hide resolved

petrelharp reviewed May 30, 2025

View reviewed changes

Update python/tskit/tables.py

294dd89

Co-authored-by: Peter Ralph <[email protected]>

hyanwong mentioned this pull request May 31, 2025

Unexpected union behaviour: edges dropped if tables have all identical nodes #3168

Closed

This was referenced Jun 10, 2025

Document multiple chromosomes #176

Open

span-wise union #3181

Open

hyanwong marked this pull request as draft June 11, 2025 12:17

Create a new merge() function, and use that for concatenate #3183

Are you sure you want to change the base?

Create a new merge() function, and use that for concatenate #3183

Conversation

hyanwong commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR Checklist:

Uh oh!

codecov bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

petrelharp commented May 29, 2025

Uh oh!

petrelharp commented May 29, 2025

Uh oh!

hyanwong commented May 29, 2025

Uh oh!

hyanwong commented May 29, 2025

Uh oh!

hyanwong commented May 29, 2025

Uh oh!

hyanwong commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented May 30, 2025

Uh oh!

Uh oh!

petrelharp May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyanwong May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrelharp commented May 30, 2025

Uh oh!

petrelharp May 30, 2025

Choose a reason for hiding this comment

Uh oh!

hyanwong May 30, 2025

Choose a reason for hiding this comment

Uh oh!

petrelharp May 31, 2025

Choose a reason for hiding this comment

Uh oh!

petrelharp commented May 30, 2025

Uh oh!

petrelharp commented May 30, 2025

Uh oh!

petrelharp commented May 30, 2025

Uh oh!

petrelharp commented May 31, 2025

Uh oh!

petrelharp commented Jun 1, 2025

Uh oh!

hyanwong commented Jun 1, 2025

Uh oh!

petrelharp commented Jun 1, 2025

Uh oh!

hyanwong commented Jun 1, 2025

Uh oh!

hyanwong commented Jun 11, 2025

Uh oh!

Uh oh!

hyanwong commented May 29, 2025 •

edited

Loading

codecov bot commented May 29, 2025 •

edited

Loading

hyanwong commented May 29, 2025 •

edited

Loading

hyanwong commented May 29, 2025 •

edited

Loading

petrelharp May 30, 2025 •

edited

Loading

hyanwong May 30, 2025 •

edited

Loading