Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Dorado give high false positive rate for 6mA? #1234

Open
Pombeboy opened this issue Jan 28, 2025 · 6 comments
Open

Does Dorado give high false positive rate for 6mA? #1234

Pombeboy opened this issue Jan 28, 2025 · 6 comments
Labels
mods For issues related to modified base calling

Comments

@Pombeboy
Copy link

Hi ONT community,

I am doing some 6mA detection after in situ DNA methylation in an organism that is suppose to have no 6mA. However, after peak calling, my control sample (without 6mA enzyme treatment) showed very high percentage of A being called as modified 6mA. Have you experienced something like this before? Thanks.

Best,
J

@malton-ont
Copy link
Collaborator

Hi @Pombeboy,

What do you mean by "showed very high percentage of A being called as modified 6mA"? Does this simply include all bases marked in the MM/ML tags, or are you applying any filtering?

Dorado will output modification probabilities for any A-base that matches the context (for DRACH models) or for any A-base where the modification probability is above the specified threshold (default 5%) in all-context models. If you want to increase this threshold (for all-context only), you can pass the --modified-bases-threshold [0..1] parameter to the basecaller.

@malton-ont malton-ont added the mods For issues related to modified base calling label Jan 29, 2025
@Pombeboy
Copy link
Author

Hi Malton-ONT,

Thanks for your message. I will use different thresholds for basecalling.

But for my sample that shouldn't contain 6mA, for a single read, it still contains quite a lot of A showing high likelihood of methylation such as >80% or even 1. I am new to this. Is this normal? The top one is the sample treated with 6mA enzyme while the bottom one is the sample that shouldn't contains any 6mA. But as you can see in the default setting, lots of 6mA are called in the untreated samples.

I assume that usually we would need high coverage so in a uniform sample, we can tell whether this base is modified or not. But if in this case, I assume the efficiency of 6mA methylation is not 100% so I can't use coverage to really gain confidence on whether certain A is modified. In that case, I would rely on the baseball likelihood on individual reads to know whether this region is frequently modified. But under the current circumstance, because unmodified samples showed high likelihood of 6mA in each read, the background become really high. I wonder whether you have any suggestions on dealing with this? Thank you!

Threshold: 0.05 (Top: modified sample, Bottom: unmodified)

Image

Threshold: 0.7 (Top: modified sample, Bottom: unmodified. As you can see the background of is very high for individual reads)

Image

Best,
Pombeboy

@ArtRand
Copy link

ArtRand commented Feb 1, 2025

Hello @Pombeboy,

Could you tell me which basecaller/modification models you used?

A couple things you might want to try:

  1. In IGV, use the 2-color mode so that the unmodified A bases are also indicated. I've found that the zooming can make the visualization misleading without highlighting the unmodified bases as well.

  2. Try using modkit call-mods docs here with a high(er) modification threshold for 6mA (something like --mod-threshold a:0.95). You can also filter the modification probabilities in IGV.

The --modified-bases-threshold in Dorado (@malton-ont correct me if I'm wrong) only elides the confidently canonical calls as a space-saving optimization, it will not remove low confidence modified calls.

If you have treated/untreated samples, the DMR function in Modkit will handle the fact that at your exogenous 6mA won't be 100% at any given position.

@malton-ont
Copy link
Collaborator

The --modified-bases-threshold in Dorado (@malton-ont correct me if I'm wrong) only elides the confidently canonical calls as a space-saving optimization, it will not remove low confidence modified calls.

I'm not clear on the difference between "space-saving optimization" and "remove low confidence calls" - if the value is below threshold then that base is marked as a skip in the MM tags and is therefore absent from the ML tag - I don't know how modkit interprets that.

dorado generates a probability for each modification in the model and therefore, implicitly, a probability for the canonical base as well. Bases are "skipped" if the modification probability is below the specified threshold - for a single-mod model like 6mA a low modification probability is equivalent to a high-confidence canonical base, but for multi-mod models (like 5mC_5hmC, for example) it would be possible for one modification to very low probability (and therefore skipped) while the other remains above the threshold (e.g. m = 0.03, h = 0.85, C = 0.12 would still include the h mod but would drop the m at 5% threshold).

@ArtRand
Copy link

ArtRand commented Feb 6, 2025

Hey @malton-ont,

I think there is a potential sharp edge here.

Bases are "skipped" if the modification probability is below the specified threshold - for a single-mod model like 6mA a low modification probability is equivalent to a high-confidence canonical base

Correct, a low probability of 6mA corresponds to a high unmodified probability, so concretely say you have

$$\begin{gather} P_{\text{6mA}} = 0.3 \\\ P_{\text{A}} = 0.7 \\\ \end{gather}$$

Under the default settings, the probability for 6mA (0.3) would be emitted in the ML tag, but say you set --modified-bases-threshold 0.3, then it will be skipped. You could say that these cases are equivalent since the status with highest probability is "unmodified". But how Modkit interprets these two cases is slightly different. If the "pass threshold" in Modkit is 0.75, which for a binary classification model is pretty normal or even a little low, the conclusion you get on this read is different:

$$\begin{gather} \textbf{modified bases threshold} \leftarrow 0.05 \\\ \textbf{pass threshold} \leftarrow 0.75 \\\ P_{\text{6mA}} = 0.3 \\\ P_{\text{A}} = 0.7 \\\ \textbf{call} \rightarrow \textbf{FAIL, below threshold} \\\ \\\ \textbf{modified bases threshold} \leftarrow 0.3 \\\ P_{\text{A}} = 1.0 \\\ \textbf{call} \rightarrow \textbf{UNMODIFIED} \end{gather}$$

Remember that when a call FAILs, we don't use it at all in the calculation of the %-modification at a genomic location.

So I suppose you could use --modified-bases-threshold as a way to filter calls, e.g. if you set --modified-bases-threshold 0.7 then $P_{\text{6mA}} = 0.7$ becomes $P_{\text{A}} = 1.0$. I would argue against doing this for a couple reasons. (1) A low confidence modification probability may indicate that the signal/read is nosier or of lower quality and the best thing to do is discard that call (what Modkit would do). If instead if you decide to clamp the probability to 0% (100% unmodified) it may skew the estimation of $p_{\text{modification, x}}$ (the probability of a read being methylated at a genomic location $x$). And (2), you can perform a similar manipulation with modkit call-mods after basecalling and modification detection, whereas if you perform the thresholding during basecalling you can't go back - the information is gone. Since basecalling tends to be more computationally expensive, I tend to try and keep as much information as possible, then filter out/transform it with simpler steps downstream.

Lastly, in the example here:

$$\begin{gather} P_{\text{5hmC}} = 0.85 \\\ P_{\text{5mC}} = 0.03 \\\ P_{\text{C}} = 0.12 \\\ \textbf{modified bases threshold} \leftarrow 0.05 \text{, becomes:} \\\ P_{\text{5hmC}} = 0.85 \\\ P_{\text{5mC}} = 0.00 \\\ P_{\text{C}} = 0.15 \\\ \end{gather}$$

$P_{\text{5hmC}} = 0.85$ would still be emitted in the MM and ML tags and would likely still be the final call, eliding the $P_{\text{5mC}} = 0.03$ wouldn't have much effect other than you wouldn't see it when doing something like modkit extract full.

@malton-ont
Copy link
Collaborator

Thanks @ArtRand, that's a nice explanation! So I guess the advice to users here is "leave the dorado threshold alone if you're planning to filter in modkit".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mods For issues related to modified base calling
Projects
None yet
Development

No branches or pull requests

3 participants