Merge pull request #1 from sanger-pathogens/PAT-2300_add_mixed_input

PAT-2300 Add mixed input
sanger-pathogens · Feb 5, 2025 · c119b09 · c119b09
2 parents a98bd89 + 76d750a
commit c119b09
Show file tree

Hide file tree

Showing 7 changed files with 156 additions and 18 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "assorted-sub-workflows"]
+	path = assorted-sub-workflows
+	url = https://github.com/sanger-pathogens/assorted-sub-workflows.git
diff --git a/README.md b/README.md
@@ -85,28 +85,92 @@ It is recommended to have at least 16GB of RAM and 100GB of free storage
 > - The pipeline generates ~1.8GB intermediate files for each sample on average
 >     - These files can be removed when the pipeline run is completed, please refer to [Clean Up](#clean-up)
 >     - To further reduce storage requirement by sacrificing the ability to resume the pipeline, please refer to [Experimental](#experimental)
+
 ## Accepted Inputs
 - Only Illumina paired-end short reads are supported
-- Each sample is expected to be a pair of raw reads following this file name pattern: 
-    - `*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}` 
-        - example 1: `SampleName_R1_001.fastq.gz`, `SampleName_R2_001.fastq.gz`
-        - example 2: `SampleName_1.fastq.gz`, `SampleName_2.fastq.gz`
-        - example 3: `SampleName_R1.fq`, `SampleName_R2.fq`
+- Any combination of the following input options are supported:
+  1. `--reads`:  
+     Specify a directory of per-sample paired   (gzipped) fastq files containing reads   (files named according to the following   pattern `*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}`):
+       - example 1: `SampleName_R1_001.  fastq.gz`, `SampleName_R2_001.fastq.gz`
+       - example 2: `SampleName_1.fastq.  gz`, `SampleName_2.fastq.gz`
+       - example 3: `SampleName_R1.fq`,   `SampleName_R2.fq`
+
+  2. `--manifest_of_reads` or `--manifest`:  
+     Specify the paths to (gzipped) fastq files
+     containing reads via a CSV manifest, listing the pair of read files pertaining to a sample, one per row.
+
+  3. **iRODS attribute parameters** (Sanger HPC only):  
+     Specify a combination of iRODS attributes to search for reads to use as pipeline input.
+
+     The selected set of data files is defined by a combination of parameters: `--studyid`, `--runid`, `--laneid`, `--plexid`, `--target` and `--type` (these refer to specifics of the sequencing experiment and data to be retrieved).
+
+     Each parameter restricts the set of data files that match and will be downloaded. With the exception of `--type` and `--target`, omitting an option causes samples for all possible values of the parameter to be retrieved.
+
+     Either `--studyid` or `--runid` is required.While `--laneid`, `--plexid`, `--target` and `--type` are optional. This avoids indiscriminately and unintentionally downloading thousands of files.
+     ```
+      --studyid
+            default: -1
+            Sequencing Study ID
+      --runid
+            default: -1
+            Sequencing Run ID
+      --laneid
+            default: -1
+            Sequencing Lane ID
+      --plexid
+            default: -1
+            Sequencing Plex ID
+      --target
+            default: 1
+            Marker of key data product likely to be of interest to customer
+      --type
+            default: cram
+            File type
+     ```
+      
+  4. `--manifest_of_lanes` (Sanger HPC only):  
+     Specify a CSV manifest listing a batch of iRODS parameter combinations.
+     
+     Valid column headings include the individual parameter options described above: `studyid`, `runid`, `laneid`, `plexid`, or any other iRODS metadata attribute, e.g. `sample_common_name`, `sample_supplier_name`.
+     Corresponding fields in the CSV manifest file can be left blank.
+     
+     `laneid` and `plexid` are only considered when provided alongside a `studyid` or `runid`.
+       - example 1:
+         ```
+         studyid,runid,laneid,plexid
+         ,37822,2,354
+         5970,37822,,332
+         5970,37822,2,
+         ```
+       - example 2:
+         ```
+         sample_common_name,type,target
+         Romboutsia lituseburensis,cram,1
+         Romboutsia lituseburensis,cram,0
+         ```
+
 ## Setup 
 > [!WARNING]
 > - Docker or Singularity must be running
 > - An Internet connection is required
-1. Clone the repository (if Git is installed on your system)
+1. Clone the repository (`git` must be installed on your system)
+    ```
+    git clone --recurse-submodules https://github.com/GlobalPneumoSeq/gps-pipeline.git
+    ```
+    > Note: The pipeline depends on git submodules. If you don't clone with `--recurse-submodules`, you can correct this with `git submodule update --init`.
+
+    To use a particular version of this pipeline, navigate into the root directory of the gps_pipeline and checkout a particular branch or tag:
     ```
-    git clone https://github.com/GlobalPneumoSeq/gps-pipeline.git
+    git checkout <tag/branch>
     ```
-    or 
-    
-    Download and unzip/extract the [latest release](https://github.com/GlobalPneumoSeq/gps-pipeline/releases)
+
+    See [Releases/Tags](./releases) and [Branches](./branches) for possibilities.
+
 2. Go into the local directory of the pipeline and it is ready to use without installation (the directory name might be different)
     ```
     cd gps-pipeline
     ```
+
 3. (Optional) You could perform an initialisation to download all required additional files and container images, so the pipeline can be used at any time with or without the Internet afterwards.
     - Using Docker as the container engine
         ```
@@ -153,6 +217,7 @@ It is recommended to have at least 16GB of RAM and 100GB of free storage
     | `standard`<br> (Default) | Docker is used as the container engine. <br> Processes are executed locally. |
     | `singularity` |  Singularity is used as the container engine. <br> Processes are executed locally. |
     | `lsf` | **The pipeline should be launched from a LSF cluster head node with this profile.** <br>Singularity is used as the container engine. <br> Processes are submitted to your LSF cluster via `bsub` by the pipeline. <br> (Tested on Wellcome Sanger Institute farm5 LSF cluster only) <br> (Option `--kraken2_memory_mapping` default change to `false`.) |
+    | `sanger` | **Only required for Sanger HPC cluster.** <br>Intended to be used in combination with `lsf` profile. |
 
 ## Resume
 > [!TIP]

diff --git a/assorted-sub-workflows b/assorted-sub-workflows
diff --git a/modules/messages.nf b/modules/messages.nf
@@ -28,10 +28,11 @@ void helpMessage() {
         |./run_pipeline [option] [value]
         |
         |All options are optional, some common options:
-        |--reads [PATH]    Path to the input directory that contains the reads to be processed
-        |--output [PATH]   Path to the output directory that save the results
-        |--init          Alternative workflow for initialisation
-        |--version       Alternative workflow for getting versions of pipeline, container images, tools and databases
+        |--reads [PATH]     Path to the input directory that contains the reads to be processed
+        |--manifest [PATH]  Path to input CSV (headings: ID,R1,R2), listing a pair (gzipped) fastq files pertaining to a sample, one per row
+        |--output [PATH]    Path to the output directory that save the results
+        |--init             Alternative workflow for initialisation
+        |--version          Alternative workflow for getting versions of pipeline, container images, tools and databases
         |
         |For all available options, please refer to README.md
         '''.stripMargin()

diff --git a/modules/validate.nf b/modules/validate.nf
@@ -1,3 +1,41 @@
+// Map of valid parameters for which to skip validation
+skipValidationParams = [
+    // From common config
+    input: 'skip',
+    tracedir: 'skip',
+    max_memory: 'skip',
+    max_cpus: 'skip',
+    max_time: 'skip',
+    max_retries: 'skip',
+    retry_strategy: 'skip',
+    queue_size: 'skip',
+    submit_rate_limit: 'skip',
+    // From mixed input config
+    outdir: 'skip',
+    manifest_of_reads: 'skip',
+    manifest_of_lanes: 'skip',
+    manifest: 'skip',
+    save_metadata: 'skip',
+    combine_same_id_crams: 'skip',
+    dehumanising_method: 'skip',
+    cleanup_intermediate_files_irods_extractor: 'skip',
+    save_fastqs: 'skip',
+    save_method: 'skip',
+    raw_reads_prefix: 'skip',
+    preexisting_fastq_tag: 'skip',
+    split_sep_for_ID_from_fastq: 'skip',
+    lane_plex_sep: 'skip',
+    start_queue: 'skip',
+    irods_subset_to_skip: 'skip',
+    short_metacsv_name: 'skip',
+    studyid: 'skip',
+    runid: 'skip',
+    laneid: 'skip',
+    plexid: 'skip',
+    target: 'skip',
+    type: 'skip'
+] 
+
 // Map of valid parameters and their value types
 validParams = [
     help: 'boolean',
@@ -31,6 +69,8 @@ validParams = [
     lite: 'boolean'
 ]
 
+validParams += skipValidationParams
+
 // Validate whether all provided parameters are valid
 void validate(Map params) {
     // Ensure only one or none of the alternative workflows is selected

diff --git a/nextflow.config b/nextflow.config
@@ -1,5 +1,10 @@
 nextflow.enable.dsl=2
 
+// Import mixed input params
+includeConfig "https://raw.githubusercontent.com/sanger-pathogens/nextflow-commons/refs/heads/master/configs/nextflow.config"
+includeConfig "$projectDir/assorted-sub-workflows/irods_extractor/subworkflows/irods.config"
+includeConfig "$projectDir/assorted-sub-workflows/mixed_input/subworkflows/mixed_input.config"
+
 // Default parameters that can be overridden
 params {
     // Show help message
@@ -13,6 +18,8 @@ params {
     reads = "$projectDir/input"
     // Default output directory
     output = "$projectDir/output"
+    // To allow mixed input to work without warnings
+    outdir = output
 
     // Default databases directory for saving all the required databases
     db = "$projectDir/databases"
@@ -63,8 +70,11 @@ params {
     lite = false
 }
 
-// Set auto-retry and process container images
 process {
+    // Avoid use of `-o pipefail` as this fails version info collation
+    shell = ['/bin/bash', '-eu']
+
+    // Set auto-retry and process container images
     maxRetries = 2
     errorStrategy = { task.attempt <= process.maxRetries ? 'retry' : 'ignore' }
 
@@ -192,4 +202,10 @@ profiles {
         }
     }
 
+    // Profile for Sanger
+    sanger {
+        singularity {
+            runOptions = '--bind /lustre,/nfs,/data,/software,/tmp'
+        }
+    }
 }
diff --git a/workflows/pipeline.nf b/workflows/pipeline.nf
@@ -10,6 +10,9 @@ include { MLST } from "$projectDir/modules/mlst"
 include { PBP_RESISTANCE; PARSE_PBP_RESISTANCE; GET_ARIBA_DB; OTHER_RESISTANCE; PARSE_OTHER_RESISTANCE } from "$projectDir/modules/amr"
 include { GENERATE_SAMPLE_REPORT; GENERATE_OVERALL_REPORT } from "$projectDir/modules/output"
 
+// Import subworkflows
+include { MIXED_INPUT } from "$projectDir/assorted-sub-workflows/mixed_input/mixed_input"
+
 // Main pipeline workflow
 workflow PIPELINE {
     main:
@@ -29,8 +32,17 @@ workflow PIPELINE {
     // Get path to ARIBA database, generate from reference sequences and metadata if ncessary
     GET_ARIBA_DB(params.ariba_ref, params.ariba_metadata, params.db)
 
+    // Obtain input from manifests and iRODS params
+    MIXED_INPUT
+    | map { meta, R1, R2 -> [meta.ID, [R1, R2]] }
+    | set { raw_read_pairs_ch }
+
     // Get read pairs into Channel raw_read_pairs_ch
-    raw_read_pairs_ch = Channel.fromFilePairs("$params.reads/*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}", checkIfExists: true)
+    if (params.reads) {
+        Channel.fromFilePairs("$params.reads/*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}", checkIfExists: true)
+        | mix(raw_read_pairs_ch)
+        | set { raw_read_pairs_ch }
+    }
 
     // Basic input files validation
     // Output into Channel FILE_VALIDATION.out.result
@@ -114,7 +126,7 @@ workflow PIPELINE {
     // Merge Channels FILE_VALIDATION.out.result & READ_QC.out.result & ASSEMBLY_QC.out.result & MAPPING_QC.out.result & TAXONOMY_QC.out.result to provide Overall QC Status
     // Output into Channel OVERALL_QC.out.result & OVERALL_QC.out.report
     OVERALL_QC(
-        raw_read_pairs_ch.map{ it[0] }
+        raw_read_pairs_ch.map{ [it[0]] }
         .join(FILE_VALIDATION.out.result, failOnDuplicate: true, remainder: true)
         .join(READ_QC.out.result, failOnDuplicate: true, remainder: true)
         .join(ASSEMBLY_QC.out.result, failOnDuplicate: true, remainder: true)
@@ -161,7 +173,7 @@ workflow PIPELINE {
 
     // Generate sample reports by merging outputs from all result-generating modules
     GENERATE_SAMPLE_REPORT(
-        raw_read_pairs_ch.map{ it[0] }
+        raw_read_pairs_ch.map{ [it[0]] }
         .join(READ_QC.out.report, failOnDuplicate: true, remainder: true)
         .join(ASSEMBLY_QC.out.report, failOnDuplicate: true, remainder: true)
         .join(MAPPING_QC.out.report, failOnDuplicate: true, remainder: true)