▄▄▄▄ ▗▞▀▀▘ ▄▄▄▄ ▄ ▗▄▄▖ ▗▖ ▗▖ ▗▄▖
█ █ ▐▌ █ █ █ ▄ ▐▌ ▐▌▐▛▚▖▐▌▐▌ ▐▌
█ █ ▐▛▀▘ █ █ █ ▐▛▀▚▖▐▌ ▝▜▌▐▛▀▜▌
▐▌ ▐▌ ▐▌▐▌ ▐▌▐▌ ▐▌
nf-mirna is a complete pipeline to process, align and analyse deep sequencing miRNA reads. This pipeline is based on the nf-core pipeline smrnaseq v2.4.0, and it is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures. It can use Docker/Singularity making installation and results highly reproducible.
- Quality check and trimming
- Raw read QC (
fastqc
) - Adapter trimming, miRNA QC, and FASTQ to FASTA conversion (
miRTrace
)
- Raw read QC (
- miRNA quantification
- Alignment against miRNA mature reference (
bowtie
) - Quantification of miRNA counts from mature alignment (
samtools idxstats
) - Alignment of unmapped reads to genome reference (
bowtie
) (Optional) - Quantification of miRNA counts from genome alignment (
htseq-count
) (Optional)
- Alignment against miRNA mature reference (
- Novel miRNAs discovery
- Mapping against reference genome and novel miRNA discovery with the miRDeep2 module (
miRDeep2
) (Optional)
- Mapping against reference genome and novel miRNA discovery with the miRDeep2 module (
- Summary results and QCs (
multiqc
) - Summary of pipeline execution (
nextflow
)
You can test the pipeline as follows:
nextflow run nf-mirna \
-profile <test,test_genome>,singularity \
--outdir <OUTDIR>
In order to use the pipeline with your own data, first prepare a samplesheet.csv
with yout input data that looks as follows:
sample,fastq_1
sample_1,10004_S37_R1_001.fastq.gz
sample_2,1006_S18_R1_001.fastq.gz
sample_3,4025_S11_R1_001.fastq.gz
sample_4,2001_S25_R1_001.fastq.gz
Each row represents a fastq file (single-end). Now, you can run the pipeline using:
nextflow run nf-mirna \
-profile <singularity,docker>,<protocol> ... \
--input samplesheet.csv \
--genome 'path/to/genome[.fa|.ga.gz]' \
--genome_index 'path/to/genome_index[dir|.tar.gz]' \
--mirna_gtf 'path/to/mirna.gtf' \
--outdir <OUTDIR>
If you need an extended summary of all possible parameters of the pipeline, you can do so by running nextflow run nf-mirna --help
.
Parameter | Description | Type | Defaults |
---|---|---|---|
Input / Output options | |||
--input |
Path to comma-separated file containing information about the samples in the experiment | string |
- |
--outdir |
The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure. | string |
./results/ |
--save_intermediates |
Save all intermediate files (e.g. fastq, bams) of all steps of the pipeline to output directory | boolean |
false |
miRTrace options | |||
--mirtrace_protocol |
Protocol to use for miRTrace QC. Must be one of illumina or nextflex . |
string |
- |
--mirtrace_species |
Species to use for miRTrace QC, see mirtrace --list-species for available options. |
string |
hsa |
--three_prime_adapter |
3' adapter sequence to use for trimming in miRTrace QC. | string |
- |
--mirtrace_title |
Custom title for miRTrace report. | string |
- |
--mirtrace_comment |
Custom comment for miRTrace report. | string |
- |
Alignment options | |||
--mature |
Path to FASTA file with mature miRNAs. Typically this will be the mature.fa file from miRBase. Can be given either as a plain text .fa file or a compressed .gz file. |
string |
miRBase.org/mature.fa |
--hairpin |
Path to FASTA file with miRNAs precursors. Typically this will be the hairpin.fa file from miRBase. Can be given either as a plain text .fa file or a compressed .gz file. Only required for the miRDeep2 module. |
string |
miRBase.org/hairpin.fa |
--mirna_gtf |
Path to GTF file with miRNA genomic coordinates. Only required for the genome alignment step. Usually a miRBase .gff3 file, typically downloaded from miRBase.org. |
string |
- |
--genome_index |
Path to the genome Bowtie1 index. This should either be a directory containing the genome index files generated by bowtie-build or its .tar.gz compressed version. |
string |
- |
--genome |
Path to the genome FASTA file. Can be given either as a plain text .fa file or a compressed .gz file. Will be used to create a genome index if none is provided and to run the miRDeep2 module. |
string |
- |
miRDeep2 options | |||
--mirdeep_mirna_other |
Path to FASTA file with other miRNAs. This file should be the pooled known mature sequences for 1-5 species closely related to the species being analyzed. Can be given either as a plain text .fa file or a compressed .gz file. |
string |
- |
--mirdeep_randfold |
Whether to run miRDeep2 with randfold analysis. | boolean |
true |
--mirdeep_mirbase_v18 |
Whether the mature reference files contain miRBase v18 identifiers (5p and 3p) instead of previous ids from v17. | boolean |
true |
--mirdeep_pdfs |
Whether to generate report PDFs. | boolean |
false |
Skipping pipeline steps | |||
--skip_fastqc |
Skips FastQC module. | boolean |
false |
--skip_genome |
Skips genome alignment step. | boolean |
false |
--skip_mirdeep |
Skips miRDeep2 module. | boolean |
false |
--skip_multiqc |
Skips MultiQC module. | boolean |
false |
A normal run of the pipeline will generate a results directory structure similar to the following:
results/
├── fastqc # raw reads QC
├── mirtrace # miRNA QC
├── bowtie
│ ├── mature # results of bowtie alignment against mature ref
│ └── genome # results of bowtie alignemnt against genome
├── mirna_quant # miRNA raw counts of mature alignment
├── genome_quant # miRNA raw counts of genome alignment
├── mirdeep2 # novel miRNA discovery results
├── multiqc # summary reports of pipeline steps
└── pipeline_info # nextflow pipeline execution reports
The directory fastqc
will contain the QC of the raw FASTQ files.
Output directory: results/fastqc/
{sample.id}_fastqc.html
: FastQC report containing quality metrics.{sample.id}_fastqc.zip
: Zip archive containing the FastQC reports, tab-delimited data and plot images.
The directory mirtrace
will contain an output directory for every sample inputed in the pipeline (after the sample
column in samplesheet.csv
).
Output directory: results/mirtrace/{sample.id}/
mirtrace.log
: The log of the miRTrace command run.mirtrace-report.html
: An interactive HTML report summarizing all output statistics from miRTrace.mirtrace-results.json
: A JSON file with all output statistics from miRTrace.mirtrace-stats*.tsv
: Tab-separated statistic files.qc_passed_reads.all.uncollapse/{sample.id}.mirtrace.fa.gz
: Compressed FASTA file per sample with sequence reads that passed QC in miRTrace.
The directory bowtie
will contain one subdirectory for the mature reference alignment and one for the genome alignment (if not skipped).
Output directory: results/bowtie/{mature|genome}/{sample.id}/
{sample.id}.flagstats
: Theflagstats
output of the alignment.{sample.id}.stats
: Thestats
output of the alignment.{sample.id}.out
: The log of the Bowtie1 alignment. Will only be generated if the--save_intermediates
parameters is set totrue
.{sample.id}.bam
: Aligned BAM file results. Will only be generated if the--save_intermediates
parameters is set totrue
.{sample.id}_unaligned.fa.gz
: The unaligned reads in a compressed FASTA format resulting from the mature reference alignment. This file is not generated during the genome alignment and will only be generated if the--save_intermediates
parameters is set totrue
.
The directory mirna_quant
will contain the quantification of the resulting BAM alignemnt files agaisnt the mature reference.
Output directory: results/mirna_quant/
{sample.id}.mature.idxstats.tsv
: Tab-separated file containing the miRNA counts from the mature reference alignment.
The directory genome_quant
will contain the quantification of the resulting BAM alignment files agaisnt the genome reference. This output directory will not be generated if the genome alignment is skipped.
Output directory: results/genome_quant/
{sample.id}.genome.htseq.tsv
: Tab-separated file containing the miRNA counts from the genome reference alignment.
The directory mirdeep2
will contain the results of the novel miRNA discovery run with an output directory for every sample inputed in the pipeline (after the sample
column in samplesheet.csv
). This output directory will not be generated if the miRDeep2 module is skipped.
Output directory: results/mirdeep2/{sample.id}/
{sample.id}_mirdeep2.log
: The log of the miRDeep2 run.{sample.id}_mirdeep2.bed
: File with the known miRNAs in BED format.{sample.id}_mirdeep2.csv
: File with an overview of all detected miRNAs (known and novel) in CSV format.{sample.id}_mirdeep2.html
: A HTML report with an overview of all detected miRNAs (known and novel) in HTML format.miRNAs_expressed_all_samples.csv
: File with the known miRNAs in CSV format.{sample.id}.genome.mirdeep.arf
: Intermediate file containing the alignment results of the miRDeep2 mapper module. Will only be generated if the--save_intermediates
parameters is set totrue
.{sample.id}.genome.mirdeep.fa
: Intermediate file containing the mapped reads from the miRDeep2 mapper module. Will only be generated if the--save_intermediates
parameters is set totrue
.
The directory multiqc
will containg the pipeline QC from the supported tools (e.g., FastQC, bowtie1), which include most of this pipeline steps.
Output directory: results/multiqc/
multiqc_report.html
: an interactive HTML report of all compatible pipeline steps.multiqc_data/
: directory containing summarised data from all compatible pipeline steps generated by MultiQC.
The directory pipeline_info
will contain various reports relevant to running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.