Skip to content

Commit

Permalink
Merge pull request #1 from sanger-pathogens/PAT-2300_add_mixed_input
Browse files Browse the repository at this point in the history
PAT-2300 Add mixed input
  • Loading branch information
Lfulcrum authored Feb 5, 2025
2 parents a98bd89 + 76d750a commit c119b09
Show file tree
Hide file tree
Showing 7 changed files with 156 additions and 18 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "assorted-sub-workflows"]
path = assorted-sub-workflows
url = https://github.com/sanger-pathogens/assorted-sub-workflows.git
85 changes: 75 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,28 +85,92 @@ It is recommended to have at least 16GB of RAM and 100GB of free storage
> - The pipeline generates ~1.8GB intermediate files for each sample on average
> - These files can be removed when the pipeline run is completed, please refer to [Clean Up](#clean-up)
> - To further reduce storage requirement by sacrificing the ability to resume the pipeline, please refer to [Experimental](#experimental)
## Accepted Inputs
- Only Illumina paired-end short reads are supported
- Each sample is expected to be a pair of raw reads following this file name pattern:
- `*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}`
- example 1: `SampleName_R1_001.fastq.gz`, `SampleName_R2_001.fastq.gz`
- example 2: `SampleName_1.fastq.gz`, `SampleName_2.fastq.gz`
- example 3: `SampleName_R1.fq`, `SampleName_R2.fq`
- Any combination of the following input options are supported:
1. `--reads`:
Specify a directory of per-sample paired (gzipped) fastq files containing reads (files named according to the following pattern `*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}`):
- example 1: `SampleName_R1_001. fastq.gz`, `SampleName_R2_001.fastq.gz`
- example 2: `SampleName_1.fastq. gz`, `SampleName_2.fastq.gz`
- example 3: `SampleName_R1.fq`, `SampleName_R2.fq`

2. `--manifest_of_reads` or `--manifest`:
Specify the paths to (gzipped) fastq files
containing reads via a CSV manifest, listing the pair of read files pertaining to a sample, one per row.

3. **iRODS attribute parameters** (Sanger HPC only):
Specify a combination of iRODS attributes to search for reads to use as pipeline input.

The selected set of data files is defined by a combination of parameters: `--studyid`, `--runid`, `--laneid`, `--plexid`, `--target` and `--type` (these refer to specifics of the sequencing experiment and data to be retrieved).

Each parameter restricts the set of data files that match and will be downloaded. With the exception of `--type` and `--target`, omitting an option causes samples for all possible values of the parameter to be retrieved.

Either `--studyid` or `--runid` is required.While `--laneid`, `--plexid`, `--target` and `--type` are optional. This avoids indiscriminately and unintentionally downloading thousands of files.
```
--studyid
default: -1
Sequencing Study ID
--runid
default: -1
Sequencing Run ID
--laneid
default: -1
Sequencing Lane ID
--plexid
default: -1
Sequencing Plex ID
--target
default: 1
Marker of key data product likely to be of interest to customer
--type
default: cram
File type
```
4. `--manifest_of_lanes` (Sanger HPC only):
Specify a CSV manifest listing a batch of iRODS parameter combinations.
Valid column headings include the individual parameter options described above: `studyid`, `runid`, `laneid`, `plexid`, or any other iRODS metadata attribute, e.g. `sample_common_name`, `sample_supplier_name`.
Corresponding fields in the CSV manifest file can be left blank.
`laneid` and `plexid` are only considered when provided alongside a `studyid` or `runid`.
- example 1:
```
studyid,runid,laneid,plexid
,37822,2,354
5970,37822,,332
5970,37822,2,
```
- example 2:
```
sample_common_name,type,target
Romboutsia lituseburensis,cram,1
Romboutsia lituseburensis,cram,0
```
## Setup
> [!WARNING]
> - Docker or Singularity must be running
> - An Internet connection is required
1. Clone the repository (if Git is installed on your system)
1. Clone the repository (`git` must be installed on your system)
```
git clone --recurse-submodules https://github.com/GlobalPneumoSeq/gps-pipeline.git
```
> Note: The pipeline depends on git submodules. If you don't clone with `--recurse-submodules`, you can correct this with `git submodule update --init`.
To use a particular version of this pipeline, navigate into the root directory of the gps_pipeline and checkout a particular branch or tag:
```
git clone https://github.com/GlobalPneumoSeq/gps-pipeline.git
git checkout <tag/branch>
```
or
Download and unzip/extract the [latest release](https://github.com/GlobalPneumoSeq/gps-pipeline/releases)
See [Releases/Tags](./releases) and [Branches](./branches) for possibilities.
2. Go into the local directory of the pipeline and it is ready to use without installation (the directory name might be different)
```
cd gps-pipeline
```
3. (Optional) You could perform an initialisation to download all required additional files and container images, so the pipeline can be used at any time with or without the Internet afterwards.
- Using Docker as the container engine
```
Expand Down Expand Up @@ -153,6 +217,7 @@ It is recommended to have at least 16GB of RAM and 100GB of free storage
| `standard`<br> (Default) | Docker is used as the container engine. <br> Processes are executed locally. |
| `singularity` | Singularity is used as the container engine. <br> Processes are executed locally. |
| `lsf` | **The pipeline should be launched from a LSF cluster head node with this profile.** <br>Singularity is used as the container engine. <br> Processes are submitted to your LSF cluster via `bsub` by the pipeline. <br> (Tested on Wellcome Sanger Institute farm5 LSF cluster only) <br> (Option `--kraken2_memory_mapping` default change to `false`.) |
| `sanger` | **Only required for Sanger HPC cluster.** <br>Intended to be used in combination with `lsf` profile. |
## Resume
> [!TIP]
Expand Down
1 change: 1 addition & 0 deletions assorted-sub-workflows
Submodule assorted-sub-workflows added at e966ed
9 changes: 5 additions & 4 deletions modules/messages.nf
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,11 @@ void helpMessage() {
|./run_pipeline [option] [value]
|
|All options are optional, some common options:
|--reads [PATH] Path to the input directory that contains the reads to be processed
|--output [PATH] Path to the output directory that save the results
|--init Alternative workflow for initialisation
|--version Alternative workflow for getting versions of pipeline, container images, tools and databases
|--reads [PATH] Path to the input directory that contains the reads to be processed
|--manifest [PATH] Path to input CSV (headings: ID,R1,R2), listing a pair (gzipped) fastq files pertaining to a sample, one per row
|--output [PATH] Path to the output directory that save the results
|--init Alternative workflow for initialisation
|--version Alternative workflow for getting versions of pipeline, container images, tools and databases
|
|For all available options, please refer to README.md
'''.stripMargin()
Expand Down
40 changes: 40 additions & 0 deletions modules/validate.nf
Original file line number Diff line number Diff line change
@@ -1,3 +1,41 @@
// Map of valid parameters for which to skip validation
skipValidationParams = [
// From common config
input: 'skip',
tracedir: 'skip',
max_memory: 'skip',
max_cpus: 'skip',
max_time: 'skip',
max_retries: 'skip',
retry_strategy: 'skip',
queue_size: 'skip',
submit_rate_limit: 'skip',
// From mixed input config
outdir: 'skip',
manifest_of_reads: 'skip',
manifest_of_lanes: 'skip',
manifest: 'skip',
save_metadata: 'skip',
combine_same_id_crams: 'skip',
dehumanising_method: 'skip',
cleanup_intermediate_files_irods_extractor: 'skip',
save_fastqs: 'skip',
save_method: 'skip',
raw_reads_prefix: 'skip',
preexisting_fastq_tag: 'skip',
split_sep_for_ID_from_fastq: 'skip',
lane_plex_sep: 'skip',
start_queue: 'skip',
irods_subset_to_skip: 'skip',
short_metacsv_name: 'skip',
studyid: 'skip',
runid: 'skip',
laneid: 'skip',
plexid: 'skip',
target: 'skip',
type: 'skip'
]

// Map of valid parameters and their value types
validParams = [
help: 'boolean',
Expand Down Expand Up @@ -31,6 +69,8 @@ validParams = [
lite: 'boolean'
]

validParams += skipValidationParams

// Validate whether all provided parameters are valid
void validate(Map params) {
// Ensure only one or none of the alternative workflows is selected
Expand Down
18 changes: 17 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
nextflow.enable.dsl=2

// Import mixed input params
includeConfig "https://raw.githubusercontent.com/sanger-pathogens/nextflow-commons/refs/heads/master/configs/nextflow.config"
includeConfig "$projectDir/assorted-sub-workflows/irods_extractor/subworkflows/irods.config"
includeConfig "$projectDir/assorted-sub-workflows/mixed_input/subworkflows/mixed_input.config"

// Default parameters that can be overridden
params {
// Show help message
Expand All @@ -13,6 +18,8 @@ params {
reads = "$projectDir/input"
// Default output directory
output = "$projectDir/output"
// To allow mixed input to work without warnings
outdir = output

// Default databases directory for saving all the required databases
db = "$projectDir/databases"
Expand Down Expand Up @@ -63,8 +70,11 @@ params {
lite = false
}

// Set auto-retry and process container images
process {
// Avoid use of `-o pipefail` as this fails version info collation
shell = ['/bin/bash', '-eu']

// Set auto-retry and process container images
maxRetries = 2
errorStrategy = { task.attempt <= process.maxRetries ? 'retry' : 'ignore' }

Expand Down Expand Up @@ -192,4 +202,10 @@ profiles {
}
}

// Profile for Sanger
sanger {
singularity {
runOptions = '--bind /lustre,/nfs,/data,/software,/tmp'
}
}
}
18 changes: 15 additions & 3 deletions workflows/pipeline.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ include { MLST } from "$projectDir/modules/mlst"
include { PBP_RESISTANCE; PARSE_PBP_RESISTANCE; GET_ARIBA_DB; OTHER_RESISTANCE; PARSE_OTHER_RESISTANCE } from "$projectDir/modules/amr"
include { GENERATE_SAMPLE_REPORT; GENERATE_OVERALL_REPORT } from "$projectDir/modules/output"

// Import subworkflows
include { MIXED_INPUT } from "$projectDir/assorted-sub-workflows/mixed_input/mixed_input"

// Main pipeline workflow
workflow PIPELINE {
main:
Expand All @@ -29,8 +32,17 @@ workflow PIPELINE {
// Get path to ARIBA database, generate from reference sequences and metadata if ncessary
GET_ARIBA_DB(params.ariba_ref, params.ariba_metadata, params.db)

// Obtain input from manifests and iRODS params
MIXED_INPUT
| map { meta, R1, R2 -> [meta.ID, [R1, R2]] }
| set { raw_read_pairs_ch }

// Get read pairs into Channel raw_read_pairs_ch
raw_read_pairs_ch = Channel.fromFilePairs("$params.reads/*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}", checkIfExists: true)
if (params.reads) {
Channel.fromFilePairs("$params.reads/*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}", checkIfExists: true)
| mix(raw_read_pairs_ch)
| set { raw_read_pairs_ch }
}

// Basic input files validation
// Output into Channel FILE_VALIDATION.out.result
Expand Down Expand Up @@ -114,7 +126,7 @@ workflow PIPELINE {
// Merge Channels FILE_VALIDATION.out.result & READ_QC.out.result & ASSEMBLY_QC.out.result & MAPPING_QC.out.result & TAXONOMY_QC.out.result to provide Overall QC Status
// Output into Channel OVERALL_QC.out.result & OVERALL_QC.out.report
OVERALL_QC(
raw_read_pairs_ch.map{ it[0] }
raw_read_pairs_ch.map{ [it[0]] }
.join(FILE_VALIDATION.out.result, failOnDuplicate: true, remainder: true)
.join(READ_QC.out.result, failOnDuplicate: true, remainder: true)
.join(ASSEMBLY_QC.out.result, failOnDuplicate: true, remainder: true)
Expand Down Expand Up @@ -161,7 +173,7 @@ workflow PIPELINE {

// Generate sample reports by merging outputs from all result-generating modules
GENERATE_SAMPLE_REPORT(
raw_read_pairs_ch.map{ it[0] }
raw_read_pairs_ch.map{ [it[0]] }
.join(READ_QC.out.report, failOnDuplicate: true, remainder: true)
.join(ASSEMBLY_QC.out.report, failOnDuplicate: true, remainder: true)
.join(MAPPING_QC.out.report, failOnDuplicate: true, remainder: true)
Expand Down

0 comments on commit c119b09

Please sign in to comment.