Skip to content

Commit 92418ba

Browse files
committed
Two-way two-locus statistics (python proto)
This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code. The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the `indexes` argument. This helps us avoid creating another `ld_matrix` method for the TreeSequence object. In other words, for a one-way statistic, a user would specify: ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2]) ``` Which would output a 3D array containing one LD matrix per sample set. ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=(0, 1)) ``` Will output a 2D array containing one LD matrix for the index pair. This would use our `D2_ij_summary_func`, instead of the `D2_summary_func`. Finally, ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1), (1, 1)]) ``` will output a 3D array containing one LD matrix _per_ index pair provided. Tests have been added to validate the result dimension for all of the possible input combinations. Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2. I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. This test does not work on unbiased statistics unless the sample sets being tested are equal in index. I've added another test for two-way unbiased statistics. I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.
1 parent 9d2ff8e commit 92418ba

File tree

3 files changed

+456
-82
lines changed

3 files changed

+456
-82
lines changed

c/tests/test_stats.c

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2870,8 +2870,7 @@ test_two_site_correlated_multiallelic(void)
28702870
double truth_D2[4] = { 0.023844603634269844, 0.02384460363426984,
28712871
0.02384460363426984, 0.02384460363426984 };
28722872
double truth_r2[4] = { 1, 1, 1, 1 };
2873-
double truth_D_prime[4] = { 0.7777777777777777, 0.4444444444444444,
2874-
0.4444444444444444, 0.6666666666666666 };
2873+
double truth_D_prime[4] = { 0, -0.5, -0.5, 0 };
28752874
double truth_r[4] = { 0.18377223398316206, -0.12212786219416509,
28762875
-0.12212786219416509, 0.2609542781331212 };
28772876
double truth_Dz[4] = { 0.0033870175616860566, 0.003387017561686057,
@@ -3003,8 +3002,8 @@ test_two_site_uncorrelated_multiallelic(void)
30033002
double truth_D[4] = { 0.05555555555555555, 0.0, 0.0, 0.05555555555555555 };
30043003
double truth_D2[4] = { 0.024691358024691357, 0.0, 0.0, 0.024691358024691357 };
30053004
double truth_r2[4] = { 1, 0, 0, 1 };
3006-
double truth_D_prime[4] = { 0.6666666666666665, 0.0, 0.0, 0.6666666666666665 };
3007-
double truth_r[4] = { 0.24999999999999997, 0.0, 0.0, 0.24999999999999997 };
3005+
double truth_D_prime[4] = { 0.0, 0.0, 0.0, 0.0 };
3006+
double truth_r[4] = { 0.25, 0.0, 0.0, 0.25 };
30083007
double truth_Dz[4] = { 0.0, 0.0, 0.0, 0.0 };
30093008
double truth_pi2[4] = { 0.04938271604938272, 0.04938271604938272,
30103009
0.04938271604938272, 0.04938271604938272 };

c/tskit/trees.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4231,7 +4231,7 @@ tsk_treeseq_D_prime(const tsk_treeseq_t *self, tsk_size_t num_sample_sets,
42314231
{
42324232
options |= TSK_STAT_POLARISED; // TODO: allow user to pick?
42334233
return tsk_treeseq_two_locus_count_stat(self, num_sample_sets, sample_set_sizes,
4234-
sample_sets, num_sample_sets, NULL, D_prime_summary_func, norm_hap_weighted,
4234+
sample_sets, num_sample_sets, NULL, D_prime_summary_func, norm_total_weighted,
42354235
num_rows, row_sites, row_positions, num_cols, col_sites, col_positions, options,
42364236
result);
42374237
}

0 commit comments

Comments
 (0)