Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for reference databases (SMR >= v4.3) #329

Open
RAWWiberg opened this issue May 13, 2022 · 11 comments
Open

Add documentation for reference databases (SMR >= v4.3) #329

RAWWiberg opened this issue May 13, 2022 · 11 comments
Assignees

Comments

@RAWWiberg
Copy link

Hi,

I'm using sortmerna to "clean" some RNA-seq data that. I'm wondering about the "new" databases, i.e. those starting smr_*. The documentation is for sortmerna v 4.3.3 is not clear on how these were generated, they seem to contain both SILVA and RFAM sequences. Coul this be clarified?

Best wishes,
Axel

@ekopylova
Copy link
Contributor

Hello @RAWWiberg ,

The main difference between the original databases distributed with SMR, and the new ones, is the updated SILVA and RFAM databases were used (e.g. SILVA 138). We took SILVA 138 SSURef NR99, SILVA 138 LSURef and latest RFAM and clustered them at different thresholds to render new SMR databases:

fast: bac-16S 85%, 5S & 5.8S RFAM seeds, rest 90%
default: bac-16S 90%, 5S & 5.8S RFAM seeds, rest 95%
sensitive: all 97%

All SMR databases have minimum 99.8% accuracy, therefore we normally suggest the fast or default versions.

Best,
Jenya

@RAWWiberg
Copy link
Author

Hi @ekopylova,
Wonderful! Thanks so much for the clarification!
Best,
Axel

@NicoleGruenheit
Copy link

Hi,

I just tested the new database using version 4.3.6 However, I get vastly different results compared to the old database (all SILVA + RFAM). The sample (Arabidopsis RNA) I tested had 5% rRNA with the old version but with the new version 19% align. Should the values be that different?

Also, with the new database I get exactly one value but before it was very useful to see the taxonomic domain the rRNA stems from. Is there an option to get this output as well or would I have to parse the blast output to get these values again?

Cheers,
Nicole

@ekopylova
Copy link
Contributor

Hello Nicole,
Did you use version 4.3.6 to test the new database and the old database? Or did you use another SortMeRNA version for the old database?
Are you also testing on the exact same input reads file? The 5% rRNA seems closer to what one would expect in an rRNA depleted experiment, whereas the 19% is closer to when rRNA was not depleted prior.
Regarding the taxonomic domain, since SortMeRNA was optimized to recognize whether a sequence is rRNA or not rather than search for optimal alignments / assignments, we removed the output of "classification" as it could change depending on the order of databases passed. You can read more on this topic here.
Thanks,
Jenya

@ltalignani
Copy link

ltalignani commented Aug 24, 2022

Hello @ekopylova,

Where can we find these new smr_* databases? In the sortmerna directory?
Thanks in advance,

@ekopylova
Copy link
Contributor

Hello,

The latest databases are here.

Best,
Jenya

@ltalignani
Copy link

Thank you for the link !

Best regards,

@sghignone
Copy link

sghignone commented Sep 8, 2022

Are there also separate taxonomy files for the new databases or I have to extract it from the fasta files?

I'm finding some missing taxonomies, e.g. in the file smr_v4.3_sensitive_db_rfam_seeds.fasta.

For example,

RFAM_14.1_RF00001_5S_rRNA;X01556.1/3-118
CUUGACGAUCAUAGAGCGUUGGAACCACCUGAUCCCUUCCCGAACUCAGAAGUGAAACGA
CGCAUCGCCGAUGGUAGUGUGGGGUUUCCCCAUGUGAGAGUAGGUCAUCGUCAAGC
RFAM_14.1_RF00001_5S_rRNA;X55260.1/3-119
UACGGCGGCCAUAGCGAAGGGGAAAUACCCGGUCCCAUCCCGAACCCGGAAGUCAAGCCC
UUCAGCGCCGAUGGUACUGCAACCGAGAGGCUGUGGGAGAGUAGGACGCCGCCGGAC
RFAM_14.1_RF00001_5S_rRNA;M16174.1/3-119
UACGGCGGCCAUAGCGGCGGGGAAACACCCGGUCCCAUGCCGAACCCGGAAGUUAAGCCU
GCCAGCGCCGAUGGUACUGCAACCGAGAGGCUGUGGGAGAGUAGGACGCCGCCGGAC
RFAM_14.1_RF00001_5S_rRNA;X55267.1/3-119
UACGGCGGCCAUAGCGGAGGGGAAACGCCCGGUCCCAUUCCGAACCCGGAAGCUAAGCCC

And in practice, the taxonomies for RFAM_14.1_RF00001_5S_rRNA are not reported in the fasta definition line.

thanks

@dangchenyuan
Copy link

thank you for your update.
now i have a question that whether I can use the one db path "--ref smr_v4.3_default_db.fasta " instead of the previous several paths such as "--ref silva-bac-16s-id90.fasta,silva-bac-23s-id98.fasta,silva-arc-16s-id95.fasta,silva-arc-23s-id98.fasta" ?

@cjfields
Copy link

Are there also separate taxonomy files for the new databases or I have to extract it from the fasta files?

I'm finding some missing taxonomies, e.g. in the file smr_v4.3_sensitive_db_rfam_seeds.fasta.

For example,

I just want to +1 this. We have a few complex projects where we're QC'ing RNA data to check for rRNA background in mixed samples (metatrx-like), so even having a rough idea on the taxonomic breakdown would be great.

@ppericard ppericard changed the title new databases documented Add documentation for reference databases (SMR >= v4.3) May 31, 2023
@ppericard
Copy link
Contributor

ppericard commented May 31, 2023

We need to add documentation on the following:

  • How the "new" reference databases were constructed (cleaning, clustering, ...)
  • What data sources were used (Silva, RFAM, ...)
  • Basic statistics describing each database (nb sequence in each kingdom, rRNA type, ...)
  • How/When to best use each database (fast, default, sensitive)
  • How is encoded the taxonomical information for each sequence in those reference databases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants