Skip to content

Create a new merge() function, and use that for concatenate #3183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented May 29, 2025

Description

It appears as if union wasn't really the right underlying function to use for concatenate, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own merge utility.

Fixes #3181

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

Copy link

codecov bot commented May 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.76%. Comparing base (9d2ff8e) to head (294dd89).
Report is 2 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (9d2ff8e) and HEAD (294dd89). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (9d2ff8e) HEAD (294dd89)
python-tests 6 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3183       +/-   ##
===========================================
- Coverage   98.91%   86.76%   -12.16%     
===========================================
  Files          17       11        -6     
  Lines        7542    24427    +16885     
  Branches     1258     4615     +3357     
===========================================
+ Hits         7460    21193    +13733     
- Misses         33     1853     +1820     
- Partials       49     1381     +1332     
Flag Coverage Δ
c-tests 86.66% <ø> (?)
lwt-tests 80.38% <ø> (?)
python-c-tests 88.18% <ø> (?)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@petrelharp
Copy link
Contributor

I strongly vote for a more descriptive name than merge, since (as we saw with union) a generic name can lead to wrong expectations. Why not union_edges or merge_edges? (this also merges eg mutations but we could think of mutations as being carried along with the edge they're on)

I also wonder if we actually want the TreeSequence method, as these kinds of operations are best done on tables?

Might you be able to write down an explicit list of what exactly gets added, in the style of union?

Other things needed for the docs:

  • a warning that the result might not be a legal tree sequence
  • more explicit description of the arguments in the TreeSequence method

@petrelharp
Copy link
Contributor

We also need some tests that use the WF-with-overlapping-gens sims (and so thus might have some samples that are parents of others, etc) - I think we have this built in to tsutil.

@hyanwong
Copy link
Member Author

Thanks @petrelharp

I strongly vote for a more descriptive name than merge

Yep, I can see that argument. I don't think of this as purely edge-based, however: it also merges mutations, sites, migrations, etc.

I also wonder if we actually want the TreeSequence method

The tree sequence method comes along for free, but we could easily 'nix it. We tend to expose these table methods to tree-sequence space too, like union, so it seemed remiss not to?

a warning that the result might not be a legal tree sequence

I think it always will be, because of the sort() call? Otherwise that will fail, I think.

more explicit description of the arguments in the TreeSequence method

Good catch: I forgot that and have now added it.

@hyanwong
Copy link
Member Author

p.s. I set add_populations=False in the concatenate method, overriding the default, as I expect concatenate to be used e.g. on multiple chromosomes with the same population definitions, and it would be confusing to create new populations for each chromosome, I feel.

Defaulting to add_populations=True for the merge method seems OK though, as this is more of a low-level thing, and matches union.

@hyanwong
Copy link
Member Author

Might you be able to write down an explicit list of what exactly gets added, in the style of union?

Good idea. I'll do that, and add some WF-with-overlapping-gens type tests (I have hacked a few with internal samples in there already, but I can add one or two more)

@hyanwong hyanwong force-pushed the concat-fix branch 2 times, most recently from 9bded44 to 8296e36 Compare May 29, 2025 19:32
@hyanwong
Copy link
Member Author

hyanwong commented May 29, 2025

Added the following list, and a couple of WF tests for concatenate. We still need to agree on a different name than "merge" though...

Items that will be ported from ``other`` into ``self`` (and given new IDs) are:

1. All edges in ``other``
2. All migrations in ``other``
3. All mutations in ``other``
4. Sites whose positions are new to ``self``
5. Individuals whose nodes are new to ``self``.
6. If ``add_populations=True``, populations whose nodes are new to ``self``

@hyanwong hyanwong force-pushed the concat-fix branch 2 times, most recently from 3dce552 to 4b7b5b2 Compare May 29, 2025 20:15
@hyanwong
Copy link
Member Author

hyanwong commented May 29, 2025

We still need to agree on a different name than "merge" though

Here's a suggestion for reducing confusion. I implement a 'union' like part to the function in this PR which skips merging edges if they are identical. Then we call this one union_edges and rename the existing union to union_nodes. We keep the name union as a deprecated (but permanently maintained) alias to union_nodes (we could or could not document it, I suppose)

It appears as if `union` wasn't really the right underlying function to use for `concatenate`, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own `merge` utility.
@petrelharp
Copy link
Contributor

a warning that the result might not be a legal tree sequence
I think it always will be, because of the sort() call? Otherwise that will fail, I think.

Yes, that's the point - I'm imagining a sentence after the list of things that will be added which says "If this results in an invalid tree sequence, this will produce an error."

raise ValueError("Non-matching populations")
individual_map = {}
population_map = {}
for new_node in np.where(node_map == tskit.NULL)[0]:
Copy link
Contributor

@petrelharp petrelharp May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should check that the nodes in the node map are identical? So something like

for oj, sj in enumerate(node_map):
    if sj != tskit.NULL:
        if self.node(sj) != other.node(oj):
            raise ValueError("Nodes do not match...")
    else:
          # continue as below

Copy link
Member Author

@hyanwong hyanwong May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess this could be an optional check though. I can imagine that there might be different node metadata sometimes. Perhaps the check_populations should be check_objects instead? Then we could also check that overlapping sites have identical metadata, etc.

@petrelharp
Copy link
Contributor

self.edges.append(
e.replace(child=node_map[e.child], parent=node_map[e.parent])
)
site_map = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could save some trouble here by doing

ns = self.sites.num_rows
self.sites.append_columns(**other.sites.asdict())
for mut in other.mutations:
    self.mutations.append(
        mut.replace(
            node=node_map[mut.node],
            site=mut.site + ns,
            parent=tskit.NULL
        )
# then after the self.sort()
self.deduplicate_sites()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I forgot about deduplicate_sites, thanks.

While I'm at it, is there a way of figuring out quickly if edges already exist in the tree sequence? There's no deduplicate_edges functionality, or similar, right? And sorting edges with duplicates will simply error out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm; no, not unsorted. See my proposal in the main thread - I think we should not worry about the cases where some edges are shared (unless we have a use case?).

@petrelharp
Copy link
Contributor

Items that will be ported from other into self (and given new IDs) are:

  1. All edges in other
  2. All migrations in other
  3. All mutations in other
  4. Sites whose positions are new to self
  5. Individuals whose nodes are new to self.
  6. If add_populations=True, populations whose nodes are new to ``self`

This list is missing the new nodes thenselves? (those for which node_mapping is NULL); otherwise, looks good!

@petrelharp
Copy link
Contributor

A general question is how much checking we do: do we check for equality of nodes that are the same? Of sites? (they could have metadata) If we allow overlapping stuff, what about mutations?

I think by default we do check it - the checking with union is very annoying but in nearly all cases where it fails because of the check_shared_overlap, the thing the person was trying to do was not right.

@petrelharp
Copy link
Contributor

About the name - I do think this could be called "edgewise" (even though it does mutations and migrations) because those could be thought of as living on edges?

I guess I'm up for the renaming of union to union_node (should it be union_nodes?), although maybe it's confusing because it adds on more than just nodes, just as this one adds on more than just edges. I need to think about this.

@petrelharp
Copy link
Contributor

Thinking out loud here: what's the difference between this and union? (I'll call "this" merge here.)
Comparing union to the list of what's added here:

  • merge will add edges between existing nodes; union won't.
  • merge will add mutations above existing nodes; union won't.
  • merge and union add the same individuals; neither adds new individuals whose nodes are all shared between the two.

Note in particular that merge does strictly more than union.

So: merge just basically goes ahead and adds everything (except not all individuals); whereas union only adds stuff related to new nodes. The reason for this is the different use cases:

  • union is designed for the case where they share history entirely past some point but have diverged more recently;
  • merge is designed for the case when they describe dijoint sections of genome.

What would union do, abstractly? Well, if a set A is the disjoint union of X and Y, and B is the disjoint union of Y and Z; then unioning A and B adds Z to A; so a fully general tree sequence union would need to figure out what Y is (ie what's the intersection). This is in general a hard (at least annoying) problem and we don't have use cases for the general case, so we're not trying to do this. Instead, we're specializing to these two special cases of union, where (union) Y is "attributes of and relationships between a given set of existing nodes"; and (merge) Y is "a given set of existing nodes".

So - given that merge does strictly more than union, and otherwise they are quite similar, it seems like a good idea to just add a new argument to union which toggles the behavior of merge. Probably something qutie simple like ts.union(other, all_edges=True, all_mutations=True)? These would turn off check_shared_equality. (I thought about calling it disjoint=True; or disjoint_edges=True or some such, since we're basically assuming that the edges to be disjoint, but this is more straightforward.) I think this would be straightforward: just changing this condition and this condition.

How does this sound?

@petrelharp
Copy link
Contributor

Note: I found another current use of union: https://tskit.dev/pyslim/docs/latest/vignette_continuing.html#continuing-the-simulation; this one could be done with either union or merge.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 1, 2025

How does this sound?

I like the idea of not creating more confusion by similarly named functions, so something like that SGTM.

I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?

@petrelharp
Copy link
Contributor

I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?

Good question. I believe the current union does this by check_shared_overlap (but we can't use that for the new option).

Want me to have a go at the C version?

@hyanwong
Copy link
Member Author

hyanwong commented Jun 1, 2025

Want me to have a go at the C version?

I would love it if you have time, thanks so much.

I'm happy to modify my python tests to check it works as intended.

This was referenced Jun 10, 2025
@hyanwong hyanwong marked this pull request as draft June 11, 2025 12:17
@hyanwong
Copy link
Member Author

I converted this to draft, as @petrelharp 's suggestion is different, but we should use the tests from this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

span-wise union
2 participants