Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determining Ml score threshold for quantitative comparison of methylation levels across two datasets at the same loci #372

Open
ngamarra opened this issue Feb 10, 2025 · 2 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@ngamarra
Copy link

Hello!

I am attempting to compare dna adenine methylation across two datasets at the same loci from
r10.4 data we have generated in two experimental conditions. We expect the methylation levels to differ substantially between the two datasets, but we want to determine a decently accurate quantitative estimate of the difference. In our analysis we apply an automatic threshold to determine "true" methylation calls.

I have been told that determining the optimal threshold is not trivial and is highly sensitive to sequencing run quality. I have been recommended to use modkit's auto threshold function. However I am worried that this thresholding may be sensitive to the total signal in the dataset and I would be worried that it would introduce distortions in comparisons across the dataset. I guess we are wondering if it would be more appropriate to threshold data using a fixed threshold or a data-informed threshold (and specifically modkits function) especially if we expect big differences between datasets?

@ArtRand
Copy link
Contributor

ArtRand commented Feb 11, 2025

Hello @ngamarra,

Do you expect that one of the samples will have very few methylated adenine residues?

What you could do is determine the values on a sampling of both conditions combined. That way I expect you would see enough true 6mA sites to estimate either a single threshold value automatically or manually with modkit sample-probs. Sampling the reads can be somewhat non-trivial, however. I think that samtools view does a pretty good job, you could sample from both conditions, merge, and pipe through modkit summary and/or modkit sample-probs. Happy to advise if you can share some plots.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Feb 11, 2025
@ngamarra
Copy link
Author

Thanks @ArtRand !

Estimating on a merged sample of both conditions makes sense in terms of finding something that could work well across both samples. I just wondered though if a constant threshold could distort things in the event that one condition has very few methylated adenines. We don't know for sure how low the signal will be in an absolute sense, but we expect it to be quite low. I expected that the threshold might be determined by the population distribution of mL scores in the bam file by modkit, in which case I feel like the noise could dominate the choice for the low mod adenine condition. Testing on a merged sample would seemingly help overcome this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants