Skip to content

Unexpected union behaviour: edges dropped if tables have all identical nodes #3168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hyanwong opened this issue May 20, 2025 · 9 comments
Closed

Comments

@hyanwong
Copy link
Member

hyanwong commented May 20, 2025

I think this should produce a table with edges spanning 0..2, but instead, the unioned TS is missing 0..1

import tskit
import numpy as np

ts = tskit.Tree.generate_comb(2, span=2).tree_sequence
# Cut it up
ts1 = ts.keep_intervals([[0, 1]])
ts2 = ts.delete_intervals([[0, 1]])
node_mapping = np.arange(ts.num_nodes)
# Paste back together
ts_both = ts2.union(ts1, node_mapping=node_mapping, check_shared_equality=False)
ts_edges_squashed = ts_both.simplify()
print(ts_both.draw_text())
assert ts.equals(ts_edges_squashed, ignore_provenance=True)  # fails

Here's what the result of union looks like

1.00┊     ┊  2  ┊
    ┊     ┊ ┏┻┓ ┊
0.00┊ 0 1 ┊ 0 1 ┊
    0     1     2
@petrelharp
Copy link
Contributor

aw crap, this does look like a bug

@hyanwong
Copy link
Member Author

hyanwong commented May 20, 2025

I would try to suggest a fix, but it's in the C code I guess, so 😱

@hyanwong
Copy link
Member Author

hyanwong commented May 21, 2025

By the way, passing a node mapping of np.arange(num_nodes) is presumably how you would make a single tree sequence out of a set with identical (shared) node tables e.g. stored from a multi-chromosome SLiM simulation, and presumably there is a trivial way to do this, so perhaps this is worth special-casing?

@hyanwong hyanwong added the bug Something isn't working label May 23, 2025
@hyanwong
Copy link
Member Author

It would be nice to fix this. Does e.g. @mufernando have any idea what the source of this bug could be?

@petrelharp
Copy link
Contributor

petrelharp commented May 29, 2025

Two notes: first, passing node_mapping=[0,1,-1] succeeds (but creates a new node). Second, this behavior is as expected from the documentation:

Returns an expanded tree sequence which contains the node-wise union of self and other,
obtained by adding the non-shared portions of other onto self. The “shared” portions are specified
using a map that specifies which nodes in other are equivalent to those in self: the node_mapping
argument should be an array of length equal to the number of nodes in other and whose entries 
are the ID of the matching node in self, or tskit.NULL if there is no matching node. Those nodes in 
other that map to tskit.NULL will be added to self, along with:
    Individuals whose nodes are new to self.
    Edges whose parent or child are new to self.
    Mutations whose nodes are new to self.
    Sites which were not present in self, if the site contains a newly added mutation.

Note that in this situation, the edges we'd like to have added have neither parent nor child new. The check happens here (that's the python test reference implementation).

I think if we were to extend union to cover this situation, we'd say like

Edges whose parent or child are new to self, or that cover a span on which the child node has no parent.

(?). However, I'm not seeing an easy way to adjust the logic there - note that in the current implementation we don't have to build trees. We certainly could figure this out, but I'm wondering if we just want a different function for this use case - as you say - which is "paste together tree sequences describing different chnks of the genome".

@petrelharp petrelharp removed the bug Something isn't working label May 29, 2025
@petrelharp
Copy link
Contributor

I'm wondering if we just want a different function for this use case - as you say - which is "paste together tree sequences describing different chnks of the genome".

Also note that doing ts1.union(ts2, ...) with node_mapping only mapping over the samples is not way to do this in general because some samples might be parents of other samples.

@hyanwong
Copy link
Member Author

Ah, good catch @petrelharp, thanks. Good that it's not a bug, but I do find this somewhat unexpected behaviour for something called union. Perhaps it's worth adding a note to the docs, if we do want to keep the current union behaviour?

In my case, I guess the way for me to work around this is to identify any edges between the indexes passed into the node_mapping that do not have a NULL entry, and simply add those by hand:

used_nodes = np.where(node_mapping != tskit.NULL)
omitted = np.logical_and(np.isin(other.edges.child, used_nodes), np.isin(other.edges.parent, used_nodes))
for e in np.where(omitted)[0]:
    new_tables.edges.append(other.edges[e])
new_tables.sort()

petrelharp added a commit to petrelharp/tskit that referenced this issue May 29, 2025
@petrelharp
Copy link
Contributor

Have a look at #3181, @hyanwong?

@hyanwong hyanwong changed the title Bug in union: edges dropped if tables have all identical nodes ~Bug in~ union: edges dropped if tables have all identical nodes May 29, 2025
@hyanwong hyanwong changed the title ~Bug in~ union: edges dropped if tables have all identical nodes Unexpected union behaviour: edges dropped if tables have all identical nodes May 29, 2025
@hyanwong
Copy link
Member Author

So this is not a bug, but a misunderstanding about what union does. See #3183 (comment) and #3183 (comment) for suggestions to clarify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants