Two-way two-locus statistics (python proto)

lkirk · lkirk · commit 92418ba6ee8f · 2025-06-13T17:03:36.000-05:00
This PR implements two-way LD statistics, specified between sample sets.
During the development of this functionality, a number of issues with
the designation of state_dims/result_dims were discovered. These have
been resolved, testing clean for existing code and providing the proper
behavior for this new code.

The mechanism by which users will specify a multi-population (or
two-way) statistic is by providing the `indexes` argument. This helps us
avoid creating another `ld_matrix` method for the TreeSequence object.

In other words, for a one-way statistic, a user would specify:
```python
ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2])
```
Which would output a 3D array containing one LD matrix per sample set.

```python
ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=(0, 1))
```
Will output a 2D array containing one LD matrix for the index
pair. This would use our `D2_ij_summary_func`, instead of the
`D2_summary_func`. Finally,
```python
ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1), (1, 1)])
```
will output a 3D array containing one LD matrix _per_ index pair
provided.

Tests have been added to validate the result dimension for all of the
possible input combinations.

Since these are two-way statistics, the indexes must be length 2. We
plan on enabling users to implement k-way via a "general_stat" api. We did
not implement anything more than two-way statistics here because of the
combinatoric explosion of logic required for indexes &gt; 2.

I added some basic tests to demonstrate that things were working
properly. If we compute two-way statistics on identical sample sets,
they should be equal to the one-way statistics. This test does not work
on unbiased statistics unless the sample sets being tested are equal in
index. I've added another test for two-way unbiased statistics.

I've also cleaned up the docstrings a bit and fixed a bug with the
D_prime statistic, which should not be weighted by haplotype frequency.
diff --git a/c/tests/test_stats.c b/c/tests/test_stats.c
@@ -2870,8 +2870,7 @@ test_two_site_correlated_multiallelic(void)
     double truth_D2[4] = { 0.023844603634269844, 0.02384460363426984,
         0.02384460363426984, 0.02384460363426984 };
     double truth_r2[4] = { 1, 1, 1, 1 };
-    double truth_D_prime[4] = { 0.7777777777777777, 0.4444444444444444,
-        0.4444444444444444, 0.6666666666666666 };
+    double truth_D_prime[4] = { 0, -0.5, -0.5, 0 };
     double truth_r[4] = { 0.18377223398316206, -0.12212786219416509,
         -0.12212786219416509, 0.2609542781331212 };
     double truth_Dz[4] = { 0.0033870175616860566, 0.003387017561686057,
@@ -3003,8 +3002,8 @@ test_two_site_uncorrelated_multiallelic(void)
     double truth_D[4] = { 0.05555555555555555, 0.0, 0.0, 0.05555555555555555 };
     double truth_D2[4] = { 0.024691358024691357, 0.0, 0.0, 0.024691358024691357 };
     double truth_r2[4] = { 1, 0, 0, 1 };
-    double truth_D_prime[4] = { 0.6666666666666665, 0.0, 0.0, 0.6666666666666665 };
-    double truth_r[4] = { 0.24999999999999997, 0.0, 0.0, 0.24999999999999997 };
+    double truth_D_prime[4] = { 0.0, 0.0, 0.0, 0.0 };
+    double truth_r[4] = { 0.25, 0.0, 0.0, 0.25 };
     double truth_Dz[4] = { 0.0, 0.0, 0.0, 0.0 };
     double truth_pi2[4] = { 0.04938271604938272, 0.04938271604938272,
         0.04938271604938272, 0.04938271604938272 };
diff --git a/c/tskit/trees.c b/c/tskit/trees.c
@@ -4231,7 +4231,7 @@ tsk_treeseq_D_prime(const tsk_treeseq_t *self, tsk_size_t num_sample_sets,
 {
     options |= TSK_STAT_POLARISED; // TODO: allow user to pick?
     return tsk_treeseq_two_locus_count_stat(self, num_sample_sets, sample_set_sizes,
-        sample_sets, num_sample_sets, NULL, D_prime_summary_func, norm_hap_weighted,
+        sample_sets, num_sample_sets, NULL, D_prime_summary_func, norm_total_weighted,
         num_rows, row_sites, row_positions, num_cols, col_sites, col_positions, options,
         result);
 }
diff --git a/python/tests/test_ld_matrix.py b/python/tests/test_ld_matrix.py

Original file line number	Diff line number	Diff line change
`@@ -4231,7 +4231,7 @@ tsk_treeseq_D_prime(const tsk_treeseq_t *self, tsk_size_t num_sample_sets,`
`4231`	`4231`	`{`
`4232`	`4232`	`options \|= TSK_STAT_POLARISED; // TODO: allow user to pick?`
`4233`	`4233`	`return tsk_treeseq_two_locus_count_stat(self, num_sample_sets, sample_set_sizes,`
`4234`		`- sample_sets, num_sample_sets, NULL, D_prime_summary_func, norm_hap_weighted,`
	`4234`	`+ sample_sets, num_sample_sets, NULL, D_prime_summary_func, norm_total_weighted,`
`4235`	`4235`	`num_rows, row_sites, row_positions, num_cols, col_sites, col_positions, options,`
`4236`	`4236`	`result);`
`4237`	`4237`	`}`