transcriptomicsonhoffman

After you log in to Hoffman2 and request a computational node:

Preprocessing the data

Assuming your fastq files are in your current working directory:

Install FastQC (here we create a new conda env to install fastqc)

conda create -n fastqc fastqc

conda activate fastqc

Inside a directory with the raw data files, run FastQC.

Interactive: creates directory called FastQC_output and stores fastqc reports in that directory

mkdir FastQC_output/
fastqc *.fastq.gz -o FastQC_output/

Job submission (recommended). Note, you may need to provide the full filepath to 1-FastQC.sh

qsub ../rna_scripts/FastQC.sh

Aggregate quality reports for all samples by using multiQC (note: for some reason I had issues with forcing multiqc to use python 3.10 so I had to use the below workaround. MultiQC takes as input a directory full of report.html files.

Create a new conda environment and deactivate the old:

conda deactivate
conda create -n multiqc

For downloading MultiQC, do not use conda, it downloads an outdated version. Instead I used pip to install the development version, and I also forced installed to $PROJECT which has enough space as opposed to the default $HOME installation

pip install --upgrade --force-reinstall git+https://github.com/MultiQC/MultiQC.git -t /u/project/jpjacobs/jpjacobs/rna_seq/

You may need to find the exact filepath to multiqc via the following command:

which multiqc

To run interactively, Replace ~/.local/bin/multiqc with the exact filepath:

python ~/.local/bin/multiqc ./

Job submission (recommended). Do this within the directory where your outputs from FastQC are located.

cd FastQC_output
qsub multiqc.sh

Copy the .html report over to your local directory with scp or push to Github from Hoffman. open report.html in a browser. For help interpreting multiqc results, see the following resoureces:
Trim adapters and low-quality reads with Trimmomatic. Since we already have trimmomatic installed in the kneaddata env, we are going to activate the kneaddata env. Note that you can append additional parameters for Trimmomatic; the command embedded in trimmomatic.sh has very gentle trimming parameters and removes adapters assuming Illumina Hiseq was the sequencer.

conda activate kneaddata

Interactive:

trimmomatic PE JJ1715_393_S43_R1_001.fastq.gz JJ1715_393_S43_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:/u/home/j/jpjacobs/project-jpjacobs/software_rna_seq/Trimmomatic/trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36

Job submission for many files (assumes you are in the directory where your raw fastQ files are located). You may need to change the filepath to point to run_trimmomatic.sh

for f in *R1_001.fastq.gz; do name=$(basename $f R1_001.fastq.gz); qsub ../../../software_rna_seq/rna_scripts/3-trimmomatic.sh ${name}R1_001.fastq.gz ${name}R2_001.fastq.gz; done

Install salmon (I downloaded the salmon-1.10.0_linux_x86_64.tar.gz to the software_rna_seq folder, then I unpacked it with tar) https://github.com/COMBINE-lab/salmon/releases

tar xzvf salmon-1.10.0_linux_x86_64.tar.gz

Use salmon to index a mouse genome

Download transcriptome file (I tried gencode first but had a lot of warnings, so I switched to ensembl). Note I've provided these for you in this repo:

wget http://ftp.ensembl.org/pub/release-111/fasta/mus_musculus_c57bl6nj/cdna/Mus_musculus_c57bl6nj.C57BL_6NJ_v1.cdna.all.fa.gz

Download annotation file. Note I've provided this in the repo:

http://ftp.ensembl.org/pub/release-111/gtf/mus_musculus_c57bl6nj/Mus_musculus_c57bl6nj.C57BL_6NJ_v1.111.gtf.gz

Index transcriptome file. Note, I've provided it in this repo but feel free to build your own or update as new releases come out:

/u/home/j/jpjacobs/project-jpjacobs/software_rna_seq/salmon/salmon-latest_linux_x86_64
bin/salmon index -t Mus_musculus_c57bl6nj.C57BL_6NJ_v1.cdna.all.fa.gz -i Mus_musculus_c57bl6nj_index -p 8

Run salmon on trimmed fastq files:

../salmon/salmon-latest_linux_x86_64/bin/salmon quant -i ../salmon/salmon-latest_linux_x86_64/Mus_musculus_c57bl6nj_index -l A -1 output_JJ1715_393_S43_R1_001.fastq_paired.fq.gz -2 output_JJ1715_393_S43_R2_001.fastq_paired.fq.gz -p 8 --gcBias --validateMappings -o JJ1715_393_quant

Job submission (Recommended)

for f in *R1_001.fastq_paired.fq.gz; do name=$(basename $f R1_001.fastq_paired.fq.gz); qsub ../rna_scripts/salmon.sh ${name}R1_001.fastq_paired.fq.gz ${name}R2_001.fastq_paired.fq.gz; done

Generating a count matrix

Follow instructions in tximport.R and txmeta.R to generate TPM/ count matrices and gene-level annotations.

References:

Walkthrough of an entire preprocessing workflow: https://bookdown.org/jean_souza/PreProcSEQ/quality-control.html#fastqc-1
Walkthrough of an entire preprocessing workflow: https://github.com/hbctraining/Intro-to-rnaseq-hpc-gt/blob/master/lessons/08_rnaseq_workflow.md
Walkthrough: https://h3abionet.github.io/H3ABionet-SOPs/RNA-Seq
Documentation for Trimmomatic: https://github.com/usadellab/Trimmomatic
Documentation for Salmon: https://combine-lab.github.io/salmon/getting_started/
Making a decoys.txt file: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2021/RNAseq/Markdowns/05_Quantification_with_Salmon_practical.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

transcriptomicsonhoffman

Preprocessing the data

Generating a count matrix

References:

Files

README.md

Latest commit

History

README.md

File metadata and controls

transcriptomicsonhoffman

Preprocessing the data

Generating a count matrix

References: