-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MMseqs2 clustering and taxonomy #6574
base: main
Are you sure you want to change the base?
Changes from all commits
6467853
77d1bef
3ad2ea4
b9da697
0ac3019
75f154d
fc600b6
403a2d9
6c308bb
afd1577
ec54759
d48ab99
71779aa
5f584df
0c250c5
f96947c
09771c7
7f2c49e
88da237
157360f
410d4f4
3c3d188
8fbf055
b82acf0
0f859b5
0bd7ef7
e1299a9
76da478
d0fac9f
279f796
cb26bf4
aafcdfe
2333a5a
79f3410
dd5ab01
cbf00be
29e9015
1d28b66
53c1550
6649879
8bac29c
784bb52
5f3a7e3
98e72f3
eb4187d
1ba4a1f
f46798e
65c07c3
3efd6ab
78ba0c6
220cd5a
c729ba6
072fc28
54a432f
fe0b983
3637da0
6d07b87
940613d
09e0d94
41cb56d
2e2364d
816451e
dde715a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
ToolVersionPEP404 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
name: data_manager_mmseqs2_database | ||
owner: iuc | ||
description: "MMseqs2 is an ultra fast and sensitive sequence search and clustering suite" | ||
homepage_url: "https://github.com/soedinglab/MMseqs2" | ||
long_description: | | ||
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. | ||
MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. | ||
The software is designed to run on multiple cores and servers and exhibits very good scalability. | ||
MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. | ||
It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed. | ||
remote_repository_url: "https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_mmseqs2_database" | ||
type: unrestricted | ||
categories: | ||
- Data Managers |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
<tool id="data_manager_mmseqs2_download" name="Download MMseqs2 databases" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" tool_type="manage_data" profile="22.05"> | ||
<description></description> | ||
<macros> | ||
<token name="@TOOL_VERSION@">15.6f452</token> | ||
<token name="@VERSION_SUFFIX@">0</token> | ||
</macros> | ||
<requirements> | ||
<requirement type="package" version="@TOOL_VERSION@">mmseqs2</requirement> | ||
</requirements> | ||
<command detect_errors="exit_code"><![CDATA[ | ||
#set $database_name = str($database).split('/')[-1] if '/' in str($database) else str($database) | ||
mkdir -p '$database_name' && | ||
mkdir -p '$out_file.extra_files_path' && | ||
mmseqs databases | ||
'$database' '$database_name'/database | ||
hugolefeuvre marked this conversation as resolved.
Show resolved
Hide resolved
|
||
'tmp' | ||
--threads "\${GALAXY_SLOTS:-1}" && | ||
mv ${database_name} '$out_file.extra_files_path' && | ||
cp '$dmjson' '$out_file' | ||
]]></command> | ||
<configfiles> | ||
<configfile name="dmjson"><![CDATA[ | ||
#from datetime import date | ||
#set $database_name = str($database).split('/')[-1] if '/' in str($database) else str($database) | ||
{ | ||
"data_tables":{ | ||
"$db_name.type":[ | ||
{ | ||
"value": "${database}-@TOOL_VERSION@-#echo date.today().strftime('%d%m%Y')#", | ||
"name": "${database} #echo date.today().strftime('%d%m%Y')#", | ||
"path": "$database_name", | ||
"version": "@TOOL_VERSION@" | ||
} | ||
] | ||
} | ||
}]]> | ||
</configfile> | ||
</configfiles> | ||
<inputs> | ||
<conditional name="db_name"> | ||
<param argument="type" type="select" label="Type of Databases"> | ||
<option value="mmseqs2_aminoacid_databases" selected="true">Aminoacid databases</option> | ||
<option value="mmseqs2_aminoacid_taxonomy_databases">Aminoacid databases that can be used for taxonomy</option> | ||
<option value="mmseqs2_nucleotide_databases">Nucleotide databases</option> | ||
<option value="mmseqs2_nucleotide_taxonomy_databases">Nucleotide databases that can be used for taxonomy</option> | ||
<option value="mmseqs2_profile_databases">Profile databases</option> | ||
</param> | ||
<when value="mmseqs2_aminoacid_databases"> | ||
<param name="database" type="select" label="MMseqs2 aminoacid databases"> | ||
<option value="UniRef100" selected="true">UniRef100</option> | ||
<option value="UniRef90">UniRef90</option> | ||
<option value="UniRef50">UniRef50</option> | ||
<option value="UniProtKB">UniProtKB</option> | ||
<option value="UniProtKB/TrEMBL">TrEMBL (UniProtKB)</option> | ||
<option value="UniProtKB/Swiss-Prot">Swiss-Prot (UniProtKB)</option> | ||
<option value="NR">NR (Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq)</option> | ||
<option value="GTDB">GTDB (Genome Taxonomy Database)</option> | ||
<option value="PDB">PDB (The Protein Data Bank)</option> | ||
</param> | ||
</when> | ||
<when value="mmseqs2_aminoacid_taxonomy_databases"> | ||
<param name="database" type="select" label="MMseqs2 aminoacid databases that can be used for taxonomy"> | ||
<option value="UniRef100" selected="true">UniRef100</option> | ||
<option value="UniRef90">UniRef90</option> | ||
<option value="UniRef50">UniRef50</option> | ||
<option value="UniProtKB">UniProtKB</option> | ||
<option value="UniProtKB/TrEMBL">TrEMBL (UniProtKB)</option> | ||
<option value="UniProtKB/Swiss-Prot">Swiss-Prot (UniProtKB)</option> | ||
<option value="NR">NR (Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq)</option> | ||
<option value="GTDB">GTDB (Genome Taxonomy Database)</option> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are these databases that are downloaded (fasta, some index, or something else)? For instance: is gtdb the full gtdb?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pretty sure that gtdb mmseqs database is not the full gtdb. I'm downloading it with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. So one thing that we should probably take case of (if we use fasta from other data tables) is that the mmseqs files are installed to a separate folder. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't have to be in a separate folder, but I prefer to work that way as several files make up the output. |
||
</param> | ||
</when> | ||
<when value="mmseqs2_nucleotide_databases"> | ||
<param name="database" type="select" label="MMseqs2 nucleotide databases"> | ||
<option value="SILVA">SILVA</option> | ||
<option value="Kalamari">Kalamari</option> | ||
<option value="NT">NT (Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS, STS, PAT, EST, HTG, and WGS)</option> | ||
<option value="Resfinder">Resfinder</option> | ||
</param> | ||
</when> | ||
<when value="mmseqs2_nucleotide_taxonomy_databases"> | ||
<param name="database" type="select" label="MMseqs2 nucleotide databases that can be used for taxonomy"> | ||
<option value="SILVA">SILVA</option> | ||
<option value="Kalamari">Kalamari</option> | ||
</param> | ||
</when> | ||
<when value="mmseqs2_profile_databases"> | ||
<param name="database" type="select" label="MMseqs2 profile databases"> | ||
<option value="PDB70">PDB70 (PDB clustered to 70% sequence identity)</option> | ||
<option value="Pfam-A.full">Pfam-A.full</option> | ||
<option value="Pfam-A.seed">Pfam-A.seed</option> | ||
<option value="Pfam-B">Pfam-B</option> | ||
<option value="CDD">CDD (Conserved Domain Database)</option> | ||
<option value="VOGDB">VOGDB (Virus Orthologous Groups)</option> | ||
<option value="dbCAN2">dbCAN2 (database of carbohydrate-active enzymes)</option> | ||
</param> | ||
</when> | ||
</conditional> | ||
</inputs> | ||
<outputs> | ||
<data name="out_file" format="data_manager_json" label="${tool.name}"/> | ||
</outputs> | ||
<tests> | ||
<test expect_num_outputs="1"> | ||
<conditional name="db_name"> | ||
<param name="type" value="mmseqs2_nucleotide_taxonomy_databases" /> | ||
<param name="database" value="SILVA" /> | ||
</conditional> | ||
<output name="out_file"> | ||
<assert_contents> | ||
<has_text text='"mmseqs2_nucleotide_taxonomy_databases":'/> | ||
<has_text text='"version": "15.6f452"'/> | ||
<has_text_matching expression='"value": "SILVA-15.6f452-[0-9]{8}"'/> | ||
<has_text_matching expression='"name": "SILVA [0-9]{8}"'/> | ||
<has_text text='"path": "SILVA"'/> | ||
</assert_contents> | ||
</output> | ||
</test> | ||
<test expect_num_outputs="1"> | ||
<conditional name="db_name"> | ||
<param name="type" value="mmseqs2_aminoacid_taxonomy_databases" /> | ||
<param name="database" value="UniProtKB/Swiss-Prot" /> | ||
</conditional> | ||
<output name="out_file"> | ||
<assert_contents> | ||
<has_text text='"mmseqs2_aminoacid_taxonomy_databases":'/> | ||
<has_text text='"version": "15.6f452"'/> | ||
<has_text_matching expression='"value": "UniProtKB/Swiss-Prot-15.6f452-[0-9]{8}"'/> | ||
<has_text_matching expression='"name": "UniProtKB/Swiss-Prot [0-9]{8}"'/> | ||
<has_text text='"path": "Swiss-Prot"'/> | ||
</assert_contents> | ||
</output> | ||
</test> | ||
</tests> | ||
<help><![CDATA[ | ||
This tool downloads databases that can be used with MMseqs2. | ||
]]></help> | ||
<citations> | ||
<citation type="doi">10.1038/nbt.3988</citation> | ||
</citations> | ||
</tool> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
<data_managers> | ||
<data_manager tool_file="data_manager/data_manager_mmseqs2_download.xml" id="mmseqs2_download_databases"> | ||
<data_table name="mmseqs2_aminoacid_databases"> | ||
<output> | ||
<column name="value"/> | ||
<column name="name"/> | ||
<column name="path" output_ref="out_file"> | ||
<move type="directory"> | ||
<source>${path}</source> | ||
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">mmseqs2/${path}</target> | ||
</move> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/mmseqs2/${path}</value_translation> | ||
<value_translation type="function">abspath</value_translation> | ||
</column> | ||
<column name="version"/> | ||
</output> | ||
</data_table> | ||
<data_table name="mmseqs2_aminoacid_taxonomy_databases"> | ||
<output> | ||
<column name="value"/> | ||
<column name="name"/> | ||
<column name="path" output_ref="out_file"> | ||
<move type="directory"> | ||
<source>${path}</source> | ||
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">mmseqs2/${path}</target> | ||
</move> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/mmseqs2/${path}</value_translation> | ||
<value_translation type="function">abspath</value_translation> | ||
</column> | ||
<column name="version"/> | ||
</output> | ||
</data_table> | ||
<data_table name="mmseqs2_nucleotide_databases"> | ||
<output> | ||
<column name="value"/> | ||
<column name="name"/> | ||
<column name="path" output_ref="out_file"> | ||
<move type="directory"> | ||
<source>${path}</source> | ||
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">mmseqs2/${path}</target> | ||
</move> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/mmseqs2/${path}</value_translation> | ||
<value_translation type="function">abspath</value_translation> | ||
</column> | ||
<column name="version"/> | ||
</output> | ||
</data_table> | ||
<data_table name="mmseqs2_nucleotide_taxonomy_databases"> | ||
<output> | ||
<column name="value"/> | ||
<column name="name"/> | ||
<column name="path" output_ref="out_file"> | ||
<move type="directory"> | ||
<source>${path}</source> | ||
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">mmseqs2/${path}</target> | ||
</move> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/mmseqs2/${path}</value_translation> | ||
<value_translation type="function">abspath</value_translation> | ||
</column> | ||
<column name="version"/> | ||
</output> | ||
</data_table> | ||
<data_table name="mmseqs2_profile_databases"> | ||
<output> | ||
<column name="value"/> | ||
<column name="name"/> | ||
<column name="path" output_ref="out_file"> | ||
<move type="directory"> | ||
<source>${path}</source> | ||
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">mmseqs2/${path}</target> | ||
</move> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/mmseqs2/${path}</value_translation> | ||
<value_translation type="function">abspath</value_translation> | ||
</column> | ||
<column name="version"/> | ||
</output> | ||
</data_table> | ||
</data_manager> | ||
</data_managers> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
UniProtKB/Swiss-Prot-15.6f452-02122024 UniProtKB/Swiss-Prot 02122024 /tmp/tmphqvxgt7v/galaxy-dev/tool-data/mmseqs2/Swiss-Prot 15.6f452 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
SILVA-15.6f452-02122024 SILVA 02122024 /tmp/tmphqvxgt7v/galaxy-dev/tool-data/mmseqs2/SILVA 15.6f452 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
#This is a sample file distributed with Galaxy that enables tools | ||
#to use a directory of metagenomics files. | ||
#file has this format (white space characters are TAB characters) | ||
#UniRef100-16102024 UniRef100 (MMseqs2) UniRef100.15.6f452 /path/to/data 15.6f452 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
<tables> | ||
<table name="mmseqs2_aminoacid_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="tool-data/mmseqs2_aminoacid_databases.loc"/> | ||
</table> | ||
<table name="mmseqs2_aminoacid_taxonomy_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="tool-data/mmseqs2_aminoacid_taxonomy_databases.loc"/> | ||
</table> | ||
<table name="mmseqs2_nucleotide_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="tool-data/mmseqs2_nucleotide_databases.loc"/> | ||
</table> | ||
<table name="mmseqs2_nucleotide_taxonomy_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="tool-data/mmseqs2_nucleotide_taxonomy_databases.loc"/> | ||
</table> | ||
<table name="mmseqs2_profile_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="tool-data/mmseqs2_profile_databases.loc"/> | ||
</table> | ||
</tables> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
<tables> | ||
<table name="mmseqs2_aminoacid_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="${__HERE__}/test-data/mmseqs2_aminoacid_databases.loc.test"/> | ||
</table> | ||
<table name="mmseqs2_aminoacid_taxonomy_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="${__HERE__}/test-data/mmseqs2_aminoacid_taxonomy_databases.loc.test"/> | ||
</table> | ||
<table name="mmseqs2_nucleotide_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="${__HERE__}/test-data/mmseqs2_nucleotide_databases.loc.test"/> | ||
</table> | ||
<table name="mmseqs2_nucleotide_taxonomy_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="${__HERE__}/test-data/mmseqs2_nucleotide_taxonomy_databases.loc.test"/> | ||
</table> | ||
<table name="mmseqs2_profile_databases" comment_char="#"> | ||
<columns>value, name, path, version</columns> | ||
<file path="${__HERE__}/test-data/mmseqs2_profile_databases.loc.test"/> | ||
</table> | ||
</tables> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
ToolVersionPEP404 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
name: mmseqs2 | ||
owner: iuc | ||
description: MMseqs2 is an ultra fast and sensitive sequence search and clustering suite | ||
long_description: | | ||
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. | ||
MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. | ||
The software is designed to run on multiple cores and servers and exhibits very good scalability. | ||
MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. | ||
It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed. | ||
categories: | ||
- Sequence Analysis | ||
- Metagenomics | ||
homepage_url: https://github.com/soedinglab/MMseqs2 | ||
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/mmsesq2 | ||
type: unrestricted | ||
auto_tool_repositories: | ||
name_template: "{{ tool_id }}" | ||
description_template: "Wrapper for the MMseqs2 tool suite: {{ tool_name }}" | ||
suite: | ||
name: "suite_mmseqs2" | ||
description: "MMseqs2 is an ultra fast and sensitive sequence search and clustering suite" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I do not like about this is that most of the downloaded databases are not properly versioned. For instance GTDB is downloaded from latest: https://github.com/soedinglab/MMseqs2/blob/c2c3ad9c2956fac691d5a6041a9a4affa7fa27ad/data/workflow/databases.sh#L148
So we can not guarantee that gtdb on one Galaxy is the same as gtdb on another Galaxy.
That's is not your fault, but upstream. I would suggest to ask upstream if they could provide versioned downloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to have a data manager that takes fasta as input and calls
mmseqs createdb
like hereL https://github.com/soedinglab/MMseqs2/blob/c2c3ad9c2956fac691d5a6041a9a4affa7fa27ad/data/workflow/databases.sh#L388fasta could be taken from other data tables .. but its difficult, because it will be multiple data bases.
Or is the fasta removed before its added to the data table, they call
rmdb
....Main question for me is: what is actually stored in the output folder and how big is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree with you about db versions, do I need to create an issue on MMseqs repo to ask them ?
If I understand correctly, the idea would be to take the fasta files already existing in Galaxy from the databases proposed by mmseqs and use
createdb
to create a database that can be used in the command suite without having to download themmseqs databases
. I wonder how easy it is to find out which fasta files are used to buildmmseqs databases
?The output folder of
createdb
command has the same composition as after ammseqs databases
(you can find an example in Swiss-prot directory in test files). There is a text file with sequences representing the database (not a fasta format), index files and files containing general information (lookup file, identifiers assigned by MMseqs2 and correspondance with original sequences).For test file which is 568K, the createdb output directory is 836K (don't know if it can be useful such a small file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be perfect.
This would be my idea.
This would be good anyway, otherwise we can not answer this question to users.