-
Notifications
You must be signed in to change notification settings - Fork 77
Create a new merge() function, and use that for concatenate #3183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3183 +/- ##
===========================================
- Coverage 98.91% 86.76% -12.16%
===========================================
Files 17 11 -6
Lines 7542 24427 +16885
Branches 1258 4615 +3357
===========================================
+ Hits 7460 21193 +13733
- Misses 33 1853 +1820
- Partials 49 1381 +1332
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
I strongly vote for a more descriptive name than I also wonder if we actually want the TreeSequence method, as these kinds of operations are best done on tables? Might you be able to write down an explicit list of what exactly gets added, in the style of Other things needed for the docs:
|
We also need some tests that use the WF-with-overlapping-gens sims (and so thus might have some samples that are parents of others, etc) - I think we have this built in to |
Thanks @petrelharp
Yep, I can see that argument. I don't think of this as purely edge-based, however: it also merges mutations, sites, migrations, etc.
The tree sequence method comes along for free, but we could easily 'nix it. We tend to expose these table methods to tree-sequence space too, like union, so it seemed remiss not to?
I think it always will be, because of the
Good catch: I forgot that and have now added it. |
p.s. I set Defaulting to |
Good idea. I'll do that, and add some WF-with-overlapping-gens type tests (I have hacked a few with internal samples in there already, but I can add one or two more) |
9bded44
to
8296e36
Compare
Added the following list, and a couple of WF tests for
|
3dce552
to
4b7b5b2
Compare
Here's a suggestion for reducing confusion. I implement a 'union' like part to the function in this PR which skips merging edges if they are identical. Then we call this one |
It appears as if `union` wasn't really the right underlying function to use for `concatenate`, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own `merge` utility.
Yes, that's the point - I'm imagining a sentence after the list of things that will be added which says "If this results in an invalid tree sequence, this will produce an error." |
raise ValueError("Non-matching populations") | ||
individual_map = {} | ||
population_map = {} | ||
for new_node in np.where(node_map == tskit.NULL)[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should check that the nodes in the node map are identical? So something like
for oj, sj in enumerate(node_map):
if sj != tskit.NULL:
if self.node(sj) != other.node(oj):
raise ValueError("Nodes do not match...")
else:
# continue as below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess this could be an optional check though. I can imagine that there might be different node metadata sometimes. Perhaps the check_populations
should be check_objects
instead? Then we could also check that overlapping sites have identical metadata, etc.
TODO: put this in https://tskit.dev/tskit/docs/latest/python-api.html#modification |
self.edges.append( | ||
e.replace(child=node_map[e.child], parent=node_map[e.parent]) | ||
) | ||
site_map = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could save some trouble here by doing
ns = self.sites.num_rows
self.sites.append_columns(**other.sites.asdict())
for mut in other.mutations:
self.mutations.append(
mut.replace(
node=node_map[mut.node],
site=mut.site + ns,
parent=tskit.NULL
)
# then after the self.sort()
self.deduplicate_sites()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I forgot about deduplicate_sites, thanks.
While I'm at it, is there a way of figuring out quickly if edges already exist in the tree sequence? There's no deduplicate_edges functionality, or similar, right? And sorting edges with duplicates will simply error out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm; no, not unsorted. See my proposal in the main thread - I think we should not worry about the cases where some edges are shared (unless we have a use case?).
This list is missing the new nodes thenselves? (those for which |
A general question is how much checking we do: do we check for equality of nodes that are the same? Of sites? (they could have metadata) If we allow overlapping stuff, what about mutations? I think by default we do check it - the checking with |
About the name - I do think this could be called "edgewise" (even though it does mutations and migrations) because those could be thought of as living on edges? I guess I'm up for the renaming of |
Co-authored-by: Peter Ralph <[email protected]>
Thinking out loud here: what's the difference between this and
Note in particular that So:
What would So - given that How does this sound? |
Note: I found another current use of |
I like the idea of not creating more confusion by similarly named functions, so something like that SGTM. I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current |
Good question. I believe the current union does this by Want me to have a go at the C version? |
I would love it if you have time, thanks so much. I'm happy to modify my python tests to check it works as intended. |
I converted this to draft, as @petrelharp 's suggestion is different, but we should use the tests from this one. |
Description
It appears as if
union
wasn't really the right underlying function to use forconcatenate
, e.g. it doesn't deal as expected with sites above samples, so I had to roll my ownmerge
utility.Fixes #3181
PR Checklist: