-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MMseqs2 clustering and taxonomy #6574
base: main
Are you sure you want to change the base?
Conversation
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* && | ||
mmseqs createtaxdb | ||
'$createtaxdb.database_type.mmseqs2_db_select.fields.path' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try:
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* && | |
mmseqs createtaxdb | |
'$createtaxdb.database_type.mmseqs2_db_select.fields.path' | |
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* && | |
ln -s '$createtaxdb.database_type.mmseqs2_db_select.fields.path' taxdb && | |
mmseqs createtaxdb | |
'taxdb' |
If needed add an extension to taxdb
.
It seems that in the end this script
is executed and I guess that the trick should work.
Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.
Of course the option to allow users to provide their own mapping needs to be preserved.
Wondering if createdb
and createtaxdb
should be separate tools / data managers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.
I don't know enough about this to be able to answer your question.
Wondering if createdb and createtaxdb should be separate tools / data managers.
We've already thought about it, but we thought it would be more interesting to be able to perform an end-to-end taxonomy analysis (fasta file to contig taxonomy) with a single galaxy module. But it can be discussed if you think it might be useful for other uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the main question is if the output of createdb
and createtaxdb
can be used by multiple tools / reused later.
There are also disadvantages wrt reproducibility when the current DB is used. But of course it also has disadvantages.
My intuition tells me to split it. Also because in jobs that mainly download data need to be handled differently on compute systems. This is not possible if compute and download tasks are mixed in a tool.
But this is only my intuition and I leave this to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, one of the major problems with separating createdb and createtaxdb is the output from these 2 modules. There are several reasons for this:
- The output format is not supported by Galaxy
- createdb creates several files which must be used together (in a collection at the very most)
- createtaxdb creates a file which must be integrated with the other files in order to be able to use the DB as a taxDB, it can take as input either the files generated by createdb, or the files from a DB which doesn't have a taxo.
In other words, createtaxdb might not be in the actual xml because the databases I'm giving it as input at the moment are only the taxonomy DBs proposed by MMseqs, but there are other DM mmseqs2 data_tables that don't have a taxonomy and for which createtaxdb could be useful (https://github.com/soedinglab/mmseqs2/wiki#downloading-databases).
This is not possible if compute and download tasks are mixed in a tool
Could this be why I'm having a problem in my test?
Finally, I don't really know what to do. On the one hand, I think it's a good idea to split it up, as this allows more flexibility for the user, but on the other hand, it seems more complicated to set up and less intuitive for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be why I'm having a problem in my test?
Unlikely. The problem seems to be just that the tool tries to write next to the input files (the databases) .. which is not possible. If the symlink trick does not work the only option seems to be to copy the database to the working directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try later by putting the database in a directory and making a symlink of that directory. Because the way the symlink was created only took into account the Swiss-Prot file in the database, but there are others (Swiss-Prot_h, Swiss-Prot_index...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory datatype might be a solution for the datatype questions.
Finally, I don't really know what to do.
Follow your intuition. Your arguments are perfectly reasonable. And we can still improve this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first quick review of the tools. Thanks already for the massive amount of work.
Ensure that the _taxonomy file is not written to the test data section Co-authored-by: M Bernt <[email protected]>
This reverts commit 940613d. Back before macros when tests passed
It seems that the tests are not currently running because of problems with MMseqs2, in particular when downloading the taxonomy part for Uniprot (as the SILVA test is running correctly). I'll wait until the problem is solved, if it takes too long then I'll use other test databases |
This reverts commit 09e0d94. Back to macros parameters since tool issue has been resolved
…N parameters (automatically set by tool)
Hi @bernt-matthias, does the PR require further, more in-depth reviews (I guess so), perhaps by other reviewers ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had another look, mainly on the data manager.
Is there any relation between the nt and the tax-nt databases? For instance are there corresponding entries?
data_managers/data_manager_mmseqs2_database/data_manager/data_manager_mmseqs2_download.xml
Outdated
Show resolved
Hide resolved
data_managers/data_manager_mmseqs2_database/data_manager/data_manager_mmseqs2_download.xml
Outdated
Show resolved
Hide resolved
data_managers/data_manager_mmseqs2_database/data_manager/data_manager_mmseqs2_download.xml
Show resolved
Hide resolved
data_managers/data_manager_mmseqs2_database/data_manager/data_manager_mmseqs2_download.xml
Outdated
Show resolved
Hide resolved
data_managers/data_manager_mmseqs2_database/data_manager/data_manager_mmseqs2_download.xml
Outdated
Show resolved
Hide resolved
data_managers/data_manager_mmseqs2_database/data_manager_conf.xml
Outdated
Show resolved
Hide resolved
data_managers/data_manager_mmseqs2_database/data_manager_conf.xml
Outdated
Show resolved
Hide resolved
data_managers/data_manager_mmseqs2_database/tool_data_table_conf.xml.sample
Outdated
Show resolved
Hide resolved
What do you mean with relation ? If the databases in mmseqs2_nucleotide_databases and mmseqs2_nucleotide_taxonomy_databases are the same ? For example SILVA which is in both datatable ? |
One suggestion: If we can agree on a set of data tables and their columns we can also split the PR in two parts: one for the tool and one for the data manager. To me it seems that the datamanager will need a while and will slow down the progreee of the tool. |
No problem if you think it's the best way to do it. I'm not an expert and I don't know much concerning all data tables already present in galaxy and what we can use and how to do it, but I'm curious to learn.
|
FOR CONTRIBUTOR: