Two-way two-locus statistics (python proto) #3179

lkirk · 2025-05-28T20:54:34Z

This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code.

The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the index argument. This helps us avoid creating another ld_matrix method for the TreeSequence object.

In other words, for a one-way statistic, a user would specify:

ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2])

Which would output a 3D ndarray containing one LD matrix per sample set.

ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1)])

Which would output a 2D ndarray containing one LD matrix for the index pair. This would use our D2_ij_summary_func, instead of the D2_summary_func. Finally, if a user provided

ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1), (1, 1)])

We would output a 3D ndarray containing one LD matrix per index pair provided.

Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2.

I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually.

I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.

Fixes #2832

lkirk · 2025-05-28T20:56:17Z

Since @apragsdale isn't in the tskit organization, pinging him here to add as reviewer.

codecov · 2025-05-28T21:08:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.63%. Comparing base (1c83828) to head (6274d10).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3179   +/-   ##
=======================================
  Coverage   89.63%   89.63%           
=======================================
  Files          28       28           
  Lines       32004    32004           
  Branches     5879     5879           
=======================================
  Hits        28688    28688           
  Misses       1886     1886           
  Partials     1430     1430

Flag	Coverage Δ
c-tests	`86.66% <ø> (ø)`
lwt-tests	`80.38% <ø> (ø)`
python-c-tests	`88.18% <ø> (ø)`
python-tests	`98.86% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
c/tskit/trees.c	`90.61% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

apragsdale

Thanks @lkirk.

If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually.

This is ok (even expected) for the unbiased statistics, since the formulation assumes sampling without replacement, but having overlapping sample sets would essentially copy those repeated sample nodes. I don't think we should worry about it.

python/tests/test_ld_matrix.py

apragsdale

Thanks for making that change. This looks good to me then.

jeromekelleher

I've had a look through and the code all LGTM. The changes seem sensible, but I'd need to spend a lot longer on it to really digest the details and understand the finer points.

I'm going to ping @petrelharp here for a review as he's much better at this sort of stuff

petrelharp · 2025-06-11T16:07:06Z

python/tests/test_ld_matrix.py

+    check_set_indexes(num_sample_sets, tuple_size * num_index_tuples, index_tuples)
+
+
+def ld_matrix(


this is the proposed API

petrelharp · 2025-06-11T16:14:45Z

Summary: this PR is

adding new summary functions
proposing how dimension-dropping in the result works (and fixing up previous stuff around this)

petrelharp · 2025-06-11T16:52:39Z

Input:

"one-way versus two-way" is determined by whether or not there are indexes.

sample_sets:

None: defaults to all samples
a list of samples (not allowed for two-way stats)
a list of lists of samples

indexes:

None: determines if the stat is one-way
a length-two tuple
a list of length-two tuples

sites or positions:

None (defaults to all sites)
a list of them (defaults to both lists the same)
a list of two lists of them

Output: full is a 3D array

first two dimensions are sites or positions; these are never dropped
third is indexes

The only question is whether the third dimension is dropped.

one-way: if sample_sets is None or a single list
two-way: if indexes is a single tuple

This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code. The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the `indexes` argument. This helps us avoid creating another `ld_matrix` method for the TreeSequence object. In other words, for a one-way statistic, a user would specify: ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2]) ``` Which would output a 3D array containing one LD matrix per sample set. ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=(0, 1)) ``` Will output a 2D array containing one LD matrix for the index pair. This would use our `D2_ij_summary_func`, instead of the `D2_summary_func`. Finally, ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1), (1, 1)]) ``` will output a 3D array containing one LD matrix _per_ index pair provided. Tests have been added to validate the result dimension for all of the possible input combinations. Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2. I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. This test does not work on unbiased statistics unless the sample sets being tested are equal in index. I've added another test for two-way unbiased statistics. I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.

jeromekelleher

I'm happy to merge this as a step in the right direction - are you ok to merge @petrelharp ?

lkirk requested a review from jeromekelleher May 28, 2025 20:54

lkirk force-pushed the two-locus-multipopulation-py branch from b77bc1b to 9f8222e Compare May 29, 2025 05:24

apragsdale reviewed May 29, 2025

View reviewed changes

python/tests/test_ld_matrix.py Outdated Show resolved Hide resolved

lkirk force-pushed the two-locus-multipopulation-py branch 5 times, most recently from 834f0c2 to dc86340 Compare June 2, 2025 07:52

apragsdale reviewed Jun 2, 2025

View reviewed changes

jeromekelleher reviewed Jun 5, 2025

View reviewed changes

jeromekelleher requested a review from petrelharp June 5, 2025 11:01

petrelharp reviewed Jun 11, 2025

View reviewed changes

lkirk force-pushed the two-locus-multipopulation-py branch 2 times, most recently from 67e7f01 to 92418ba Compare June 13, 2025 22:05

lkirk force-pushed the two-locus-multipopulation-py branch from 92418ba to 6274d10 Compare June 13, 2025 22:08

jeromekelleher approved these changes Jun 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Two-way two-locus statistics (python proto) #3179

Two-way two-locus statistics (python proto) #3179

Uh oh!

lkirk commented May 28, 2025 •

edited

Loading

Uh oh!

lkirk commented May 28, 2025

Uh oh!

codecov bot commented May 28, 2025 •

edited

Loading

Uh oh!

apragsdale left a comment

Uh oh!

Uh oh!

apragsdale left a comment

Uh oh!

jeromekelleher left a comment

Uh oh!

petrelharp Jun 11, 2025

Uh oh!

petrelharp commented Jun 11, 2025

Uh oh!

petrelharp commented Jun 11, 2025 •

edited

Loading

Uh oh!

jeromekelleher left a comment

Uh oh!

Uh oh!

		check_set_indexes(num_sample_sets, tuple_size * num_index_tuples, index_tuples)


		def ld_matrix(

Two-way two-locus statistics (python proto) #3179

Are you sure you want to change the base?

Two-way two-locus statistics (python proto) #3179

Uh oh!

Conversation

lkirk commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkirk commented May 28, 2025

Uh oh!

codecov bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

apragsdale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apragsdale left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

petrelharp Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

petrelharp commented Jun 11, 2025

Uh oh!

petrelharp commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lkirk commented May 28, 2025 •

edited

Loading

codecov bot commented May 28, 2025 •

edited

Loading

petrelharp commented Jun 11, 2025 •

edited

Loading