After you log in to Hoffman2 and request a computational node:
Assuming your fastq files are in your current working directory:
- Install FastQC (here we create a new conda env to install fastqc)
conda create -n fastqc fastqc
conda activate fastqc
- Inside a directory with the raw data files, run FastQC.
Interactive: creates directory called FastQC_output
and stores fastqc reports in that directory
mkdir FastQC_output/
fastqc *.fastq.gz -o FastQC_output/
Job submission (recommended). Note, you may need to provide the full filepath to 1-FastQC.sh
qsub ../rna_scripts/FastQC.sh
- Aggregate quality reports for all samples by using multiQC (note: for some reason I had issues with forcing multiqc to use python 3.10 so I had to use the below workaround. MultiQC takes as input a directory full of report.html files.
Create a new conda environment and deactivate the old:
conda deactivate
conda create -n multiqc
For downloading MultiQC, do not use conda, it downloads an outdated version. Instead I used pip to install the development version, and I also forced installed to $PROJECT which has enough space as opposed to the default $HOME installation
pip install --upgrade --force-reinstall git+https://github.com/MultiQC/MultiQC.git -t /u/project/jpjacobs/jpjacobs/rna_seq/
You may need to find the exact filepath to multiqc via the following command:
which multiqc
To run interactively, Replace ~/.local/bin/multiqc with the exact filepath:
python ~/.local/bin/multiqc ./
Job submission (recommended). Do this within the directory where your outputs from FastQC are located.
cd FastQC_output
qsub multiqc.sh
-
Copy the .html report over to your local directory with
scp
or push to Github from Hoffman. open report.html in a browser. For help interpreting multiqc results, see the following resoureces: -
Trim adapters and low-quality reads with Trimmomatic. Since we already have trimmomatic installed in the kneaddata env, we are going to activate the kneaddata env. Note that you can append additional parameters for Trimmomatic; the command embedded in
trimmomatic.sh
has very gentle trimming parameters and removes adapters assuming Illumina Hiseq was the sequencer.
conda activate kneaddata
Interactive:
trimmomatic PE JJ1715_393_S43_R1_001.fastq.gz JJ1715_393_S43_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:/u/home/j/jpjacobs/project-jpjacobs/software_rna_seq/Trimmomatic/trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36
Job submission for many files (assumes you are in the directory where your raw fastQ files are located). You may need to change the filepath to point to run_trimmomatic.sh
for f in *R1_001.fastq.gz; do name=$(basename $f R1_001.fastq.gz); qsub ../../../software_rna_seq/rna_scripts/3-trimmomatic.sh ${name}R1_001.fastq.gz ${name}R2_001.fastq.gz; done
- Install salmon (I downloaded the salmon-1.10.0_linux_x86_64.tar.gz to the
software_rna_seq
folder, then I unpacked it with tar) https://github.com/COMBINE-lab/salmon/releases
tar xzvf salmon-1.10.0_linux_x86_64.tar.gz
- Use salmon to index a mouse genome
Download transcriptome file (I tried gencode first but had a lot of warnings, so I switched to ensembl). Note I've provided these for you in this repo:
wget http://ftp.ensembl.org/pub/release-111/fasta/mus_musculus_c57bl6nj/cdna/Mus_musculus_c57bl6nj.C57BL_6NJ_v1.cdna.all.fa.gz
Download annotation file. Note I've provided this in the repo:
http://ftp.ensembl.org/pub/release-111/gtf/mus_musculus_c57bl6nj/Mus_musculus_c57bl6nj.C57BL_6NJ_v1.111.gtf.gz
Index transcriptome file. Note, I've provided it in this repo but feel free to build your own or update as new releases come out:
/u/home/j/jpjacobs/project-jpjacobs/software_rna_seq/salmon/salmon-latest_linux_x86_64
bin/salmon index -t Mus_musculus_c57bl6nj.C57BL_6NJ_v1.cdna.all.fa.gz -i Mus_musculus_c57bl6nj_index -p 8
- Run salmon on trimmed fastq files:
../salmon/salmon-latest_linux_x86_64/bin/salmon quant -i ../salmon/salmon-latest_linux_x86_64/Mus_musculus_c57bl6nj_index -l A -1 output_JJ1715_393_S43_R1_001.fastq_paired.fq.gz -2 output_JJ1715_393_S43_R2_001.fastq_paired.fq.gz -p 8 --gcBias --validateMappings -o JJ1715_393_quant
Job submission (Recommended)
for f in *R1_001.fastq_paired.fq.gz; do name=$(basename $f R1_001.fastq_paired.fq.gz); qsub ../rna_scripts/salmon.sh ${name}R1_001.fastq_paired.fq.gz ${name}R2_001.fastq_paired.fq.gz; done
- Follow instructions in
tximport.R
andtxmeta.R
to generate TPM/ count matrices and gene-level annotations.
- Walkthrough of an entire preprocessing workflow: https://bookdown.org/jean_souza/PreProcSEQ/quality-control.html#fastqc-1
- Walkthrough of an entire preprocessing workflow: https://github.com/hbctraining/Intro-to-rnaseq-hpc-gt/blob/master/lessons/08_rnaseq_workflow.md
- Walkthrough: https://h3abionet.github.io/H3ABionet-SOPs/RNA-Seq
- Documentation for Trimmomatic: https://github.com/usadellab/Trimmomatic
- Documentation for Salmon: https://combine-lab.github.io/salmon/getting_started/
- Making a decoys.txt file: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2021/RNAseq/Markdowns/05_Quantification_with_Salmon_practical.html