Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering bedmethyl file and DMR analysis #364

Open
baibhav-bioinfo opened this issue Feb 3, 2025 · 6 comments
Open

Filtering bedmethyl file and DMR analysis #364

baibhav-bioinfo opened this issue Feb 3, 2025 · 6 comments
Labels
DMR modkit dmr question Looking for clarification on inputs and/or outputs

Comments

@baibhav-bioinfo
Copy link

baibhav-bioinfo commented Feb 3, 2025

Hello,
I am using modkit to analyse the results from Dorado.

(1) I have generated the bedmethyl file from bam file. Now i need a filter criteria for "coverage" and "mod_rate" to get rid of noisy predictions.

can we directly use the filter on column "Nvalid_cov" as >=20 reads? or do we need to normalise it for per million reads?
(2) for Differential methylation analysis between conditions i am using dmr pair, following command
modkit dmr pair -a c6_r1.bed.gz -a c6_r2.bed.gz -a c6_r3.bed.gz -b dr6_r1.bed.gz -b dr6_r2.bed.gz -b dr6_r3.bed.gz -o dmr_result --ref Genome.fa --base A --threads 96 --log-filepath dmr_result.log

(i) I wonder how the modkit makes the unified list of sites from both conditions with replicates
(ii) how the modkit tools handles the sites which are present in one condition and not in another.
(iii) also what kind of test modkit applies to get the DMR sites

Thanks

@baibhav-bioinfo baibhav-bioinfo changed the title Filtering bedmethyl file result based on coverage Filtering bedmethyl file and DMR analysis Feb 4, 2025
@ArtRand
Copy link
Contributor

ArtRand commented Feb 4, 2025

Hello @baibhav-bioinfo,

(1) I have generated the bedmethyl file from bam file. Now i need a filter criteria for "coverage" and "mod_rate" to get rid of noisy predictions.

You don't actually need to filter your input data for DMR. The model won't assign a high score or significant p-value to sites with very low coverage. You can find details about the model in the documentation. That being said, you may want to simply ignore positions with low valid coverage so you don't have them in the output, there is a --min-valid-coverage option for that.

can we directly use the filter on column "Nvalid_cov" as >=20 reads? or do we need to normalise it for per million reads?

You do not have to perform any normalization, however there are --max-coverages and --cap-coverages options if you have very imbalanced data. With your command, the replicates are matched (meaning you have 3 of each), so you will see the balanced output as well.

(i) I wonder how the modkit makes the unified list of sites from both conditions with replicates
A site must be present in at least 1 replicate from each condition
(ii) how the modkit tools handles the sites which are present in one condition and not in another.
If a site is not present in any of the replicates in one condition, it will not be scored (there's nothing to compare!).
(iii) also what kind of test modkit applies to get the DMR sites

You can find the details of the model in the documentation

@ArtRand ArtRand added question Looking for clarification on inputs and/or outputs DMR modkit dmr labels Feb 4, 2025
@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Feb 4, 2025

(1) what if we have more number of reads in one sample than other. Then the --min-valid-coverage cutoff might get biased towards the sample with more overall depth. so, isnt it better to normalise?

(2) like you said the comparison is only done if site is present in atleast one replicae of both condition, then what about the sites which are only present in one condition, those should be interesting to see too.

@ArtRand
Copy link
Contributor

ArtRand commented Feb 7, 2025

Hello @baibhav-bioinfo,

(1) what if we have more number of reads in one sample than other. Then the --min-valid-coverage cutoff might get biased towards the sample with more overall depth. so, isnt it better to normalise?

I think the "balanced MAP-based p-value" and "balanced effect size" are similar to what you're looking for. I've described how this works in another issue. If one replica has low valid coverage, you don't really want it to have a equal influence on the overall scoring of a position since there's likely always going to be some sampling bias. By comparing the two values as @kylepalos has done here, you may find some positions that should be investigated.

(2) like you said the comparison is only done if site is present in atleast one replicae of both condition, then what about the sites which are only present in one condition, those should be interesting to see too.

These positions aren't output right now. But I agree that you may want to see them. For example, maybe there is a C>D event that drops a site out of a replica or condition. I'll see about adding these sites to the output.

@baibhav-bioinfo
Copy link
Author

Thankyou so much for the detailed reply.

So, if i want to analyse the sites which are only present in one of the conditions, can i use any other method manually.
such as EdgeR in a way that it supports our modification data.

@ArtRand
Copy link
Contributor

ArtRand commented Feb 7, 2025

@baibhav-bioinfo

if i want to analyse the sites which are only present in one of the conditions

Do you mean looking for intra-condition variability? I.e. differentially methylated regions between replicates? You can use dmr multi for that.

@ArtRand
Copy link
Contributor

ArtRand commented Feb 18, 2025

Hello @baibhav-bioinfo,

Another user discovered a bug where some samples don't have alignments to a contig it will cause the whole contig to fail. I have posted a build on that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DMR modkit dmr question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants