Skip to content

Commit 4a8d78a

Browse files
authored
Add tests, use ena-webin-cli handler, refactor modules.conf and update docs (#39)
Features: * replace `ena-webin-cli` with `ena-webin-cli handler` * added `webin-cli.jar` download to reuse it for every upload * webin-cli-wrapper outputs TSV with accessions which resolves #31 * FASTAVALIDATOR step added to genomesubmit (previously we only validated fasta in assemblysubmit) Tests: * multiple tests added along with snapshots and profiles to test` --mode mag` and `--mode metagenomic_assembly`, test data pushed to nf-datasets * more info about tests #32 (comment) Bug fixes: * added `triggers` to only download DBs if there are data to process and no local DB provided, * create output folder for metadata CSV/TSV if it doesn't exist, * resolved OUT_OF_MEM problem in webin-cli-wrapper, * path in metadata CSV/TSV leading to http locations for remote files * solves missing secrets issue #45 * solves problem with nf-tests on GitHub actions #45 Other: * massive docs update * refactor `modules.config` and clean up published results * WEBIN_ACCOUNT renamed to ENA_WEBIN, WEBIN_PASSWORD to ENA_WEBIN_PASSWORD
1 parent 8c3e48d commit 4a8d78a

83 files changed

Lines changed: 2479 additions & 381 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/actions/nf-test/action.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,12 @@ runs:
5656
channel-priority: strict
5757
conda-remove-defaults: true
5858

59+
- name: Configure Nextflow secrets
60+
shell: bash
61+
run: |
62+
nextflow secrets set ENA_WEBIN "$WEBIN_ACCOUNT"
63+
nextflow secrets set ENA_WEBIN_PASSWORD "$WEBIN_PASSWORD"
64+
5965
- name: Run nf-test
6066
shell: bash
6167
env:

CITATIONS.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,40 @@
1414

1515
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
1616
17+
- [CoverM](https://github.com/wwood/CoverM)
18+
19+
> Aroney ST, Newell RJ, Nissen JN, Camargo AP, Tyson GW, Woodcroft BJ. CoverM: Read alignment statistics for metagenomics. Bioinformatics. 2025;41(4):btaf147. doi: 10.1093/bioinformatics/btaf147. PubMed PMID: 40193404; PubMed Central PMCID: PMC11993303.
20+
21+
- [CheckM2](https://github.com/chklovski/CheckM2)
22+
23+
> Chklovski A, Parks DH, Woodcroft BJ, Tyson GW. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods. 2023;20(8):1203-1212. doi: 10.1038/s41592-023-01940-w. PubMed PMID: 37500759; PubMed Central PMCID: not available.
24+
25+
- [CAT and BAT](https://doi.org/10.1186/s13059-019-1817-x)
26+
27+
> von Meijenfeldt FAB, Arkhipova K, Cambuy DD, Coutinho FH, Dutilh BE. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. 2019;20(1):217. doi: 10.1186/s13059-019-1817-x. PubMed PMID: 31640809; PubMed Central PMCID: PMC6805573.
28+
29+
- [tRNAscan-SE 2.0](https://doi.org/10.1093/nar/gkab688)
30+
31+
> Chan PP, Lin BY, Mak AJ, Lowe TM. tRNAscan-SE 2.0: Improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 2021;49(16):9077-9096. doi: 10.1093/nar/gkab688. PubMed PMID: 34417604; PubMed Central PMCID: PMC8450103.
32+
33+
- [barrnap](https://github.com/tseemann/barrnap)
34+
35+
> Seemann T. Barrnap: rapid ribosomal RNA prediction. GitHub repository. https://github.com/tseemann/barrnap
36+
37+
## Submission and helper tools
38+
39+
- [ENA Webin-CLI](https://github.com/enasequence/webin-cli)
40+
41+
> European Nucleotide Archive. Webin command line submission interface (Webin-CLI). GitHub repository. https://github.com/enasequence/webin-cli
42+
43+
- [assembly_uploader](https://github.com/EBI-Metagenomics/assembly_uploader)
44+
45+
> EBI Metagenomics. ENA Metagenome Assembly uploader. GitHub repository. https://github.com/EBI-Metagenomics/assembly_uploader
46+
47+
- [genome_uploader](https://github.com/EBI-Metagenomics/genome_uploader)
48+
49+
> EBI Metagenomics. ENA public Bins and MAGs uploader. GitHub repository. https://github.com/EBI-Metagenomics/genome_uploader
50+
1751
## Software packaging/containerisation tools
1852

1953
- [Anaconda](https://anaconda.com)

README.md

Lines changed: 30 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,9 @@ Currently, the pipeline supports three submission modes, each routed to a dedica
3838

3939
Setup your environment secrets before running the pipeline:
4040

41-
`nextflow secrets set WEBIN_ACCOUNT "Webin-XXX"`
41+
`nextflow secrets set ENA_WEBIN "Webin-XXX"`
4242

43-
`nextflow secrets set WEBIN_PASSWORD "XXX"`
43+
`nextflow secrets set ENA_WEBIN_PASSWORD "XXX"`
4444

4545
Make sure you update commands above with your authorised credentials.
4646

@@ -55,43 +55,52 @@ The input must follow `assets/schema_input_genome.json`.
5555
Required columns:
5656

5757
- `sample`
58-
- `fasta` (must end with `.fa.gz` or `.fasta.gz`)
58+
- `fasta` (must end with `.fa.gz`, `.fasta.gz`, or `.fna.gz`)
5959
- `accession`
6060
- `assembly_software`
6161
- `binning_software`
6262
- `binning_parameters`
63-
- `stats_generation_software`
6463
- `metagenome`
6564
- `environmental_medium`
6665
- `broad_environment`
6766
- `local_environment`
6867
- `co-assembly`
6968

70-
Columns that required for now, but will be optional in the nearest future:
69+
At least one of the following must be provided per row:
7170

71+
- reads (`fastq_1`, optional `fastq_2` for paired-end)
72+
- `genome_coverage`
73+
74+
Additional supported columns:
75+
76+
- `stats_generation_software`
7277
- `completeness`
7378
- `contamination`
74-
- `genome_coverage`
7579
- `RNA_presence`
7680
- `NCBI_lineage`
7781

78-
Those fields are metadata required for [genome_uploader](https://github.com/EBI-Metagenomics/genome_uploader) package.
82+
If `genome_coverage`, `stats_generation_software`, `completeness`, `contamination`, `RNA_presence`, or `NCBI_lineage` are missing, the workflow can calculate or infer them when the required inputs are available.
83+
84+
Those fields are metadata required for the [genome_uploader](https://github.com/EBI-Metagenomics/genome_uploader) package.
7985

80-
Example `samplesheet_genome.csv`:
86+
Example `samplesheet_genomes.csv`:
8187

8288
```csv
83-
sample,fasta,accession,assembly_software,binning_software,binning_parameters,stats_generation_software,completeness,contamination,genome_coverage,metagenome,co-assembly,broad_environment,local_environment,environmental_medium,RNA_presence,NCBI_lineage
84-
lachnospira_eligens,data/bin_lachnospira_eligens.fa.gz,SRR24458089,spades_v3.15.5,metabat2_v2.6,default,CheckM2_v1.0.1,61.0,0.21,32.07,sediment metagenome,No,marine,cable_bacteria,marine_sediment,No,d__Bacteria;p__Proteobacteria;s_unclassified_Proteobacteria
89+
sample,fasta,accession,fastq_1,fastq_2,assembly_software,binning_software,binning_parameters,stats_generation_software,completeness,contamination,genome_coverage,metagenome,co-assembly,broad_environment,local_environment,environmental_medium,RNA_presence,NCBI_lineage
90+
lachnospira_eligens,data/bin_lachnospira_eligens.fa.gz,SRR24458089,,,spades_v3.15.5,metabat2_v2.6,default,CheckM2_v1.0.1,61.0,0.21,32.07,sediment metagenome,No,marine,cable_bacteria,marine_sediment,No,d__Bacteria;p__Proteobacteria;s__unclassified_Proteobacteria
8591
```
8692

93+
> [!IMPORTANT]
94+
> **Samplesheet column requirements**: All columns shown in the example above must be present in your samplesheet, even if some values are empty. Columns must be in exactly the same order as shown.
95+
8796
### `metagenomic_assemblies` mode (`ASSEMBLYSUBMIT`)
8897

8998
The input must follow `assets/schema_input_assembly.json`.
9099

91100
Required columns:
92101

93102
- `sample`
94-
- `fasta` (must end with `.fa.gz` or `.fasta.gz`)
103+
- `fasta` (must end with `.fa.gz`, `.fasta.gz`, or `.fna.gz`)
95104
- `run_accession`
96105
- `assembler`
97106
- `assembler_version`
@@ -111,6 +120,9 @@ assembly_1,data/contigs_1.fasta.gz,data/reads_1.fastq.gz,data/reads_2.fastq.gz,,
111120
assembly_2,data/contigs_2.fasta.gz,,,42.7,ERR011323,MEGAHIT,1.2.9
112121
```
113122

123+
> [!IMPORTANT]
124+
> **Samplesheet column requirements**: All columns shown in the example above must be present in your samplesheet, even if some values are empty. Columns must be in exactly the same order as shown.
125+
114126
## Usage
115127

116128
> [!NOTE]
@@ -122,6 +134,10 @@ All data submitted through this pipeline must be associated with an ENA study (p
122134

123135
See the [usage documentation](docs/usage.md#submission-study) for more details.
124136

137+
### Database setup (`CheckM2` and `CAT_pack`)
138+
139+
The `mags`/`bins` workflow requires databases for completeness/contamination estimation and taxonomy assignment. See [Usage documentation](usage.md) for details.
140+
125141
### Required parameters:
126142

127143
| Parameter | Description |
@@ -137,7 +153,7 @@ See the [usage documentation](docs/usage.md#submission-study) for more details.
137153
| Parameter | Description |
138154
| ------------------- | ---------------------------------------------------------------------------------------- |
139155
| `--upload_tpa` | Flag to control the type of assembly study (third party assembly or not). Default: false |
140-
| `--test_upload` | Upload to TEST ENA server instead of LIVE. Default: false |
156+
| `--test_upload` | Upload to TEST ENA server instead of LIVE. Default: true |
141157
| `--webincli_submit` | If set to false, submissions will be validated, but not submitted. Default: true |
142158

143159
General command template:
@@ -202,8 +218,8 @@ For more details and further functionality, please refer to the [usage documenta
202218

203219
Key output locations in `--outdir`:
204220

205-
- `upload/manifests/`: generated manifest files for submission
206-
- `upload/webin_cli/`: ENA Webin CLI reports
221+
- `mags/` or `bins/`: genome metadata, manifests, and per-sample submission support files
222+
- `metagenomic_assemblies/`: assembly metadata CSVs and per-sample coverage files
207223
- `multiqc/`: MultiQC summary report
208224
- `pipeline_info/`: execution reports, trace, DAG, and software versions
209225

assets/samplesheet_genomes.csv

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
sample,fasta,accession,fastq_1,fastq_2,assembly_software,binning_software,binning_parameters,stats_generation_software,completeness,contamination,genome_coverage,metagenome,co-assembly,broad_environment,local_environment,environmental_medium,rRNA_presence,NCBI_lineage
2-
lachnospira_eligens,https://github.com/nf-core/test-datasets/raw/seqsubmit/test_data/bins/bin_lachnospira_eligens.fa.gz,SRR24458089,spades_v3.15.5,mags_v1,default,CheckM2_v1.0.1,61.0,0.21,32.07,sediment metagenome,False,marine,cable bacteria,marine sediment,False,d__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfobacterales;f__Desulfobulbaceae;g__Candidatus Electrothrix;s__
3-
lachnospiraceae,https://github.com/nf-core/test-datasets/raw/seqsubmit/test_data/bins/bin_lachnospiraceae.fa.gz,SRR24458087,spades_v3.15.5,mags_v1,default,CheckM2_v1.0.1,92.81,1.09,66.04,sediment metagenome,False,marine,cable bacteria,marine sediment,False,d__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfobacterales;f__Desulfobulbaceae;g__Candidatus Electrothrix;s__Candidatus Electrothrix marina
1+
sample,fasta,accession,fastq_1,fastq_2,assembly_software,binning_software,binning_parameters,stats_generation_software,completeness,contamination,genome_coverage,metagenome,co-assembly,broad_environment,local_environment,environmental_medium,RNA_presence,NCBI_lineage
2+
lachnospira_eligens,https://github.com/nf-core/test-datasets/raw/seqsubmit/test_data/bins/bin_lachnospira_eligens.fa.gz,SRR24458089,,,spades_v3.15.5,mags_v1,default,CheckM2_v1.0.1,61.0,0.21,32.07,sediment metagenome,No,marine,cable bacteria,marine sediment,No,d__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfobacterales;f__Desulfobulbaceae;g__Candidatus Electrothrix;s__
3+
lachnospiraceae,https://github.com/nf-core/test-datasets/raw/seqsubmit/test_data/bins/bin_lachnospiraceae.fa.gz,SRR24458087,,,spades_v3.15.5,mags_v1,default,CheckM2_v1.0.1,92.81,1.09,66.04,sediment metagenome,No,marine,cable bacteria,marine sediment,No,d__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfobacterales;f__Desulfobulbaceae;g__Candidatus Electrothrix;s__Candidatus Electrothrix marina

assets/schema_input_assembly.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717
"type": "string",
1818
"format": "file-path",
1919
"exists": true,
20-
"pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.f(ast)?a\\.gz$",
21-
"errorMessage": "FASTA file must be provided and have extension '.fa', '.fasta', '.fas', '.fna' (optionally gzipped)",
20+
"pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.(fa|fasta|fna)\\.gz$",
21+
"errorMessage": "FASTA file must be provided and have extension '.fa.gz', '.fasta.gz', '.fna.gz'",
2222
"description": "Metagenomic assembly FASTA file"
2323
},
2424
"fastq_1": {

assets/schema_input_genome.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717
"type": "string",
1818
"format": "file-path",
1919
"exists": true,
20-
"pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.f(ast)?a\\.gz$",
21-
"errorMessage": "FASTA file for sequences 1 must be provided, cannot contain spaces and must have extension '.fa.gz' or '.fasta.gz'",
20+
"pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.(fa|fasta|fna)\\.gz$",
21+
"errorMessage": "FASTA file for sequences 1 must be provided, cannot contain spaces and must have extension '.fa.gz', '.fasta.gz', or '.fna.gz'",
2222
"description": "MAG/bin sequence file"
2323
},
2424
"accession": {
@@ -117,6 +117,7 @@
117117
"required": [
118118
"sample",
119119
"fasta",
120+
"accession",
120121
"assembly_software",
121122
"co-assembly",
122123
"binning_software",

conf/base.config

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@
1010

1111
process {
1212

13-
// TODO nf-core: Check the defaults for all processes
1413
cpus = { 1 * task.attempt }
1514
memory = { 6.GB * task.attempt }
1615
time = { 4.h * task.attempt }
@@ -24,8 +23,6 @@ process {
2423
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
2524
// If possible, it would be nice to keep the same label naming convention when
2625
// adding in your local modules too.
27-
// TODO nf-core: Customise requirements for specific processes.
28-
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
2926
withLabel:process_single {
3027
cpus = { 1 }
3128
memory = { 6.GB * task.attempt }

0 commit comments

Comments
 (0)