Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SILVA reference #21

Open
sjanssen2 opened this issue Nov 18, 2017 · 40 comments
Open

SILVA reference #21

sjanssen2 opened this issue Nov 18, 2017 · 40 comments

Comments

@sjanssen2
Copy link
Collaborator

sjanssen2 commented Nov 18, 2017

Improvement Description
It should be possible to download the QIIME compatible version of Silva and construct reference phylogeny and alignment for SEPP to enable 18S analyses.

Questions

  1. @josenavas @wasade do you know if release 128 is the latest?

  2. How and where would we host SEPP compatible references? Within this Plugin (which is already 130 MB large), on the github repo?

@josenavas
Copy link

We could store the references in the FTP server.

@rachaellappan
Copy link

Hi @sjanssen2,

Would it be possible in the near future to also create and make available in QIIME2 a pre-compiled SILVA v132 database? I note your comment here that making the database ready for use in q2-fragment-insertion takes around 2 weeks, which is my main reason for not attempting the steps outlined here by @smirarab.

It's great that a pre-compiled SILVA v128 database comes packaged with this plugin in QIIME! I've simply already done some analysis with SILVA v132 and am on a tight schedule, so don't have the time to re-analyse with 128 - at the moment this unfortunately prevents me from using the fragment insertion method to build trees.

Cheers,
Rachael

@thermokarst
Copy link
Contributor

Hey there @rachaellappan --- we would love to get some help with this task - are you interested? If you don't have the bandwidth, maybe you could cross-post this request to the QIIME 2 Forum, that way more eyes see this? Thanks!

@antgonza
Copy link
Member

Just adding to the discussion. For the GG release we did a lot of benchmarks and basically this is what was used in the fragment insertion paper. However, AFAIK, such benchmarks have not been done in SILVA so it will be great if someone actually did these benchmarks, in case @rachaellappan is interested.

@sjanssen2
Copy link
Collaborator Author

regarding benchmarks: there is already a lot of infrastructure in place, for example the wonderful repo https://github.com/caporaso-lab/tax-credit-data/ which I used a couple of month ago to add SEPP as another tool to assign taxonomy and of course all the notebooks I used for our paper https://msystems.asm.org/content/3/3/e00021-18

I think we should first provide the necessary changes for SEPP to deal with different references before we think too hard about benchmark results.

@antgonza
Copy link
Member

I'll argue that having them at the same time would be great; as you can imagine, once it's out there, it's out there and in the case there is a bug or something wrong that wasn't caught cause there were no benchmarks, it can get ugly ... my 2 pesos!

@rachaellappan
Copy link

Hi @thermokarst, I will post to the QIIME2 forum. I would like to help out but I'm not very familiar with what is being done here and whether these steps are all that's required.

If I understand correctly, I agree that benchmarking SILVA (to demonstrate/confirm the improvement that fragment insertion offers over de novo trees in the case of SILVA?) would be ideal to do around the same time as providing v132 for SEPP. The SILVA aligned rep set doesn't specify whether it's 16S or 18S - does it contain both? - so the results may be different to GG.

I'm probably not the person to do this - no experience with benchmarking =)

@smirarab
Copy link

smirarab commented Jan 18, 2019 via email

@adityabandla
Copy link

adityabandla commented Feb 27, 2019

Hey there @rachaellappan --- we would love to get some help with this task - are you interested? If you don't have the bandwidth, maybe you could cross-post this request to the QIIME 2 Forum, that way more eyes see this? Thanks!

In case this hasn't been done yet, I would be glad to pitch in. But I would need the scripts required to process the QIIME formatted SILVA file (SILVA_132_QIIME_release/rep_set_aligned/99/99_alignment.fna)

@adityabandla
Copy link

Can anyone confirm if these modified steps would be right (taken from https://github.com/smirarab/sepp-refs/tree/master/silva)?

99_alignment.fna has 425098 sequences
run_seqtools.py -masksites 2125 -infile 99_alignment.fna -outfile 99_alignment_masked.fna
nw_topology -bI 99_otus.tre > 99_otus_nice.tree
raxmlHPC-PTHREADS -s 99_alignment_masked.fna -m GTRCAT -n scoreF-99_alignment_masked.fna-g 99_otus_nice.tree -F -T 24 -p 8956
raxmlHPC-PTHREADS -s 99_alignment_masked.fna -m GTRCAT -n score-bl-99_alignment_masked.fna -F -f e -t RAxML_result.scoreF-99_alignment_masked.fna -T 24 -p 10625

@adityabandla
Copy link

Is this issue still alive?

@sjanssen2
Copy link
Collaborator Author

Hi Aditya,
yes it is still current, but maybe not too active at the moment. I am very busy meeting important deadlines until mid of March. Thereafter, this is on my to do list and help is extremely welcome; since I think this issue is a show stopper for many application scenarios.

@adityabandla
Copy link

Hi Stefan

Sure. I was wondering if I can get started on this at my end since its a heavy compute. All I would need is if someone can confirm the steps that need to be run.

Ofcourse, I will share the files for review once done and perhaps that would be mid-March already

@sjanssen2
Copy link
Collaborator Author

All I know about Silva is what Siavash did to convert / prepare the data vor Silva 12.8: https://github.com/smirarab/sepp-refs/tree/master/silva Maybe you can induce if you are dealing with the correct files?

@adityabandla
Copy link

Yes, Stefan, I went through what Siavash had done and am sure I have the correct files with me. I wasn't entirely clear though how the masksites parameter was chosen for the first step. That's where I need some advise as the total number of sequences is different for v132

Perhaps @smirarab can pitch in?

@sjanssen2
Copy link
Collaborator Author

ups, now I see that you already pointed to this link. Sorry for not paying enough attention :-/

@adityabandla
Copy link

Any updates on this, we are well past mid march?

@sjanssen2
Copy link
Collaborator Author

Hi Aditya,

fair point. Sorry for the delay. I started working on SEPP itself to add the ability to easily change reference in an convenient way for QIIME2 users. This procedure should include a) adding SEPP to a CI system (Travis) b) update code style c) add ability to pass info files to sepp binaries d) package SEPP as a bioconda recipe. I am happy to receive some code reviews smirarab/sepp#41 and thus increase visibility and quality.

I just downloaded the 3 GB of Silva's QIIME compatible version 13.2 https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip I am pretty confident that the alignment file is SILVA_132_QIIME_release/rep_set_aligned/99/99_alignment.fna.zip and the matching phylogeny is SILVA_132_QIIME_release/trees/99/99_otus.tre. Both hold the very same 425,098 identifiers.

I figure you already know the right computational steps to perform, but I am not totally sure if the numeric parameters will also work for the slightly larger 13.2 release. Guess we will learn that the hard way :-/

@smirarab
Copy link

smirarab commented Mar 25, 2019 via email

@sjanssen2
Copy link
Collaborator Author

I am trying to create a bioconda recipe for Siavash's SEPP program (without the heavy sized reference files) to support - in the long run - different references like Silva or others.
Currently, I fail linting of the recipe, since I don't know how to properly deal with the situation that python is in principle platform independent, but SEPP ships pre-compiled platform dependent binaries. Can someone please help, maybe @thermokarst or @ebolyen ?

@adityabandla
Copy link

Is this something being still considered?

@sjanssen2
Copy link
Collaborator Author

The bioconda package has been created: https://anaconda.org/bioconda/sepp (without reference files), but is not yet integrated into Qiime2.

@adityabandla
Copy link

Stefan, thats great to hear. Are the updated reference files for SILVA available as well?

@sjanssen2
Copy link
Collaborator Author

Hi @adityabandla,

files for Silva 12.8 (phylogeny, alignment and info) are shipped with the default Qiime2 install and should be located in $CONDA_PREFIX/share/fragment-insertion/ref (activate your conda environment first such that CONDA_PREFIX points to the right directory).

Did you succeed in creating a reference for Silva 13.2? If so, would you be willing to share those files with me / the Qiime community?

My PR #32 contains necessary updates for the qiime2 wrapper to cope with the new parameter for the info file, but it is still not merged into master. Thus, to use other references than Greengenes 13.8 you either have to overwrite the info file each time or use the run-sepp.sh script directly.

Best,
Stefan

@adityabandla
Copy link

Hi Stefan

Sorry, I never managed to get to it. I just started and I ran into this error with the very first step

Traceback (most recent call last):
File run_seqtools.py", line 7, in <module> exec(compile(f.read(), __file__, 'exec'))
File "run_seqtools.py", line 36, in <module> alg.read_file_object(args.infile,args.informat)
File "alignment.py", line 1335, in read_file_object for name, seq in read_func(file_obj):
File "alignment.py", line 75, in read_fasta raise Exception("Error: illegal characeters in sequence at line %d" % line_number)
Exception: Error: illegal characeters in sequence at line 1

@sjanssen2
Copy link
Collaborator Author

Hi @adityabandla I would need much more information about what you are trying to execute to be able to help debugging.

@adityabandla
Copy link

I am trying to run the following command when I get that error
run_seqtools.py -masksites 2125 -infile 99_alignment.fna -outfile 99_alignment_masked.fna

Please let me know if you need additional details

@smirarab
Copy link

smirarab commented Jun 27, 2019 via email

@adityabandla
Copy link

@smirarab Siavash, its the file I downloaded from the SILVA website, https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip, the particular file being SILVA_132_QIIME_release/rep_set_aligned/99/99_alignment.fna.zip

@sjanssen2 sjanssen2 mentioned this issue Sep 2, 2019
@Rkubinski
Copy link

Rkubinski commented Nov 5, 2019

@adityabandla @smirarab is there any progress on using silva 132 ?

@smirarab
Copy link

smirarab commented Nov 18, 2019 via email

@smirarab
Copy link

smirarab commented Dec 5, 2019 via email

@zhanxw
Copy link

zhanxw commented Jan 21, 2020

@smirarab Your question is also related to mine: smirarab/sepp-refs#2.
In SILVA 128, the FASTA file has dots too. Do you know the solution to make run_seqtools.py working?

@smirarab
Copy link

smirarab commented Jan 21, 2020 via email

@ETaSky
Copy link

ETaSky commented Jun 30, 2020

Any updates on this issue? Thanks!

@smirarab
Copy link

smirarab commented Jul 2, 2020 via email

@jgerken
Copy link

jgerken commented Nov 25, 2020

@smirarab the first sequence seems to be anomalous on the first view, so it might be good to exclude it. For the other sequences, I checked some of the accession numbers and they are from genome or WGS sequence set entries. Those entries, sometimes contain contaminations from different domains. I am pretty sure that this is the case here. I think we should discuss how the sequences that are included in the tree are selected and if that can be optimised to leave this problematic sequences out. By the way, the current SILVA release is 138.1.

I am not familiar with QIIME, the fragment placing plugin or SEPP. I think the easiest approach would be that you send an email to our support email address (contact(at)arb-silva.de) giving us a short summary what data is need and how it is compiled and which issues you have (maybe there are more than just the routing of the trees?). With that information we then will try to help you solving the issues you are facing. We would also like to host the reference files on the SILVA website and see if we can find a way to automatically generate them with new SILVA releases, if possible.

All the best
Jan from the SILVA team

@smirarab
Copy link

smirarab commented Dec 10, 2020 via email

@lisa55asil
Copy link

Any update on a SLIVA reference database formatted for SEPP through qiime2?

@sjanssen2
Copy link
Collaborator Author

not that I am aware of, unfortunately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests