This workflow performs RNA-seq quantification using Salmon, a fast and accurate tool for transcript expression estimation. The workflow is designed to be simple to use while implementing best practices for RNA-seq analysis.
- Builds a Salmon Index from your reference transcriptome
- Quantifies Transcripts for each of your RNA-seq samples
- Generates Expression Matrices combining results from all samples
- Cromwell or another WDL-compatible workflow engine
- Docker (the workflow uses the
combinelab/salmon
container) - Input files:
- Reference transcriptome (FASTA format)
- RNA-seq reads (FASTQ format, can be gzipped)
-
Download the WDL file from this repository:
-
Create an inputs JSON file (e.g.,
inputs.json
):{ "SalmonRnaSeq.transcriptome_fasta": "path/to/transcriptome.fa", "SalmonRnaSeq.fastq_r1_files": [ "path/to/sample1_R1.fastq.gz", "path/to/sample2_R1.fastq.gz" ], "SalmonRnaSeq.fastq_r2_files": [ "path/to/sample1_R2.fastq.gz", "path/to/sample2_R2.fastq.gz" ] }
-
Run the workflow with Cromwell:
java -jar cromwell.jar run salmon_rnaseq.wdl -i inputs.json
Parameter | Description | Required? |
---|---|---|
transcriptome_fasta |
Reference transcriptome in FASTA format | Yes |
fastq_r1_files |
Array of FASTQ files for read 1 (or single-end reads) | Yes |
fastq_r2_files |
Array of FASTQ files for read 2 (for paired-end data) | No |
salmon_docker |
Docker image for Salmon (default: "combinelab/salmon:latest") | No |
Output | Description |
---|---|
salmon_index_tar |
Compressed Salmon index (can be reused for future analyses) |
salmon_quant_dirs |
Compressed quantification results for each sample |
merged_tpm_matrix |
Combined TPM values matrix for all samples |
merged_counts_matrix |
Combined read counts matrix for all samples |
{
"SalmonRnaSeq.transcriptome_fasta": "references/gencode.v38.transcripts.fa",
"SalmonRnaSeq.fastq_r1_files": [
"samples/sample1_R1.fastq.gz",
"samples/sample2_R1.fastq.gz"
],
"SalmonRnaSeq.fastq_r2_files": [
"samples/sample1_R2.fastq.gz",
"samples/sample2_R2.fastq.gz"
]
}
{
"SalmonRnaSeq.transcriptome_fasta": "references/gencode.v38.transcripts.fa",
"SalmonRnaSeq.fastq_r1_files": [
"samples/sample1.fastq.gz",
"samples/sample2.fastq.gz"
]
}
You can download reference transcriptomes from:
- GENCODE (human/mouse)
- Ensembl (many species)
- UCSC Genome Browser
For human, a common choice is the GENCODE reference:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.transcripts.fa.gz
gunzip gencode.v38.transcripts.fa.gz
The workflow provides compressed output directories for each sample. To extract a specific sample's results:
tar -xzf sample1_quant.tar.gz
This will create a directory sample1_quant
containing Salmon's output files, including:
quant.sf
: The main quantification results filelogs/
: Directory containing Salmon log fileslib_format_counts.json
: Information about the library type
- TPM (Transcripts Per Million): Normalized expression values suitable for comparing expression levels between samples
- counts: Estimated number of fragments/reads from each transcript, suitable for differential expression analysis
This workflow:
- Creates a Salmon index from your transcriptome
- Processes each sample with optimal settings:
- Automatic library type detection
- GC bias correction
- Sequence-specific bias correction
- Mapping validation
- Combines results into unified matrices with properly labeled sample names
Error: "Docker image not found"
- Solution: Ensure Docker is installed and running
Error: "File not found"
- Solution: Check the paths in your inputs.json file
Error: "Memory allocation failed"
- Solution: Adjust the
memory_gb
parameters in the WDL file
If you need to modify the workflow for advanced settings:
-
Edit the runtime parameters at the task level:
runtime { docker: docker_image memory: "~{memory_gb} GB" cpu: cpu disks: "local-disk ~{disk_size_gb} SSD" preemptible: 1 }
-
Add additional Salmon parameters in the command sections if needed