Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado for basecalling vs modification detection #354

Open
baibhav-bioinfo opened this issue Jan 27, 2025 · 2 comments
Open

Dorado for basecalling vs modification detection #354

baibhav-bioinfo opened this issue Jan 27, 2025 · 2 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@baibhav-bioinfo
Copy link

Hello everyone,
I am new to nanopore DRS dataset, and just figuring out the different file formats and tools.

"Dorado basecaller" can be used to actual basecall sequence reads from raw pod5 files out of nanopore sequencing machines.

also "Dorado basecaller" can be used for detecting modifications in sequences using some extra command arguments.

  1. As both commands produce "calls.bam" files, i wanted to know what is the difference between these two command outputs?
    Is the difference is only between the presence or absence of modification information?

  2. if i convert the bam into fastq files for getting actual sequence reads, will they both be same?

@marcus1487
Copy link
Contributor

Modified bases are output as a set of BAM/SAM tags. Details about these tags and how to run Dorado for modified base detection can be found in the Dorado documentation.

Here are the key points to address your question:

  • Canonical Basecalling vs. Modified Base Detection: Modified base calls are generated after canonical basecalling is complete, meaning the sequence field in the BAM file will be identical in both cases.
  • FASTQ Conversion: The FASTQ format does not natively support BAM/SAM tags, so modified base calls will be lost during conversion. However, you can preserve the tags in the FASTQ file using samtools fastq -T "*" command. Note that downstream tool support for such files depends on the tool in question.
  • Recommendation: Whenever possible, we recommend using tools within the Dorado ecosystem that can directly process BAM files with modified base calls.

@marcus1487 marcus1487 added the question Looking for clarification on inputs and/or outputs label Jan 27, 2025
@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Jan 27, 2025

i have ran the "modified basecalling" as I am conducting m6A analysis.

  1. So, if i want the fastq files of just sequences (for other analysis like DEGs etc.), can i use the bam files i got from the same run or do i have to run the "dorado canonical basecalling" separately?
  2. Also, how do the dorado basecalling behaves with the polyA tails in end of each read? I want to keep the PolyA tail entirely as it was in the DRS read or remove whole thing. I mean i dont want the polyA tails to be basecalled partially. Is there any way to do that?
  3. Are there any papers which have used dorado for m6A calling, which i can use as a template for my study? please let me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants