GeneTEFlow: A Nextflow-based one-stop pipeline for differential expression analysis of genes and locus-specific transposable elements from RNA sequencing

1. Introduction

GeneTEFlow is a reproducible and platform-independent workflow for the comprehensive analysis of gene and locus-specific TEs expression from RNA-Seq data using Nextflow and Docker technologies.

2. Installation

Section 1: Install docker and singularity (need "root" permission)

Step 1:

Installation of Docker on Ubuntu Linux system:

# apt-get install docker-ce

# docker --version

Docker version 18.03.1-ce, build 9ee9f40

# which docker

/usr/bin/docker

Step 2:

Installation of Singularity on Ubuntu Linux system:

# apt-get install singularity-container

# singularity --version

2.5.1-master.gd6e81547

# which singularity

/usr/local/bin/singularity

Section 2: Getting GeneTEflow from github:

# git clone https://github.com/zhongw2/GeneTEFlow

Section 3: Build images (need "root" permission)

Using Dockerfile of GeneTEFlow.Process as an example:

# cd GeneTEFlow_Dockerfiles/GeneTEFlow.Process/

# docker build -t rnaseq_pipeline.app .

Ref: https://docs.docker.com/engine/reference/commandline/build/

Optional:

If you need to run containers by Singularity, another step is required to convert docker images to Singularity images:

# cd /mnt/

# docker run -v /var/run/docker.sock:/var/run/docker.sock -v /mnt:/output --privileged -t --rm singularityware/docker2singularity rnaseq_pipeline.app

Ref: https://github.com/singularityware/docker2singularity

The output file is a Singularity container under /mnt directory. For example, filename is "rnaseq_pipeline.app-2020-3-29-cf77fe9d8630.simg".

You may rename it, for example, to "rnaseq_pipeline.hpc.simg" and run it on High Performance Computing (HPC) clusters by Singularity.

Section 4: Testing containers

Testing the docker container:

$ docker run rnaseq_pipeline.app ls /RANSeq

Ref: https://docs.docker.com/engine/reference/commandline/run/

Testing the Singularity container:

$ singularity exec rnaseq_pipeline.hpc.simg ls /RANSeq

Ref: https://singularity.lbl.gov/docs-run

Section 5: install Nextflow

Optional:

You might need to create a new user account for running nextflow. For instance, create a user account with name: "geneteflow1":

# useradd -m geneteflow1 -d /mnt/geneteflow1 -s /bin/bash

# passwd geneteflow1 ( geneteflow123 )

Then:

Login as user geneteflow1, and install Nextflow on Ubuntu Linux system:

$ pwd

/home/geneteflow1

$ curl -s https://get.nextflow.io | bash

$ ./nextflow run hello

Ref: https://www.nextflow.io/

3. Running GeneTEFlow

Section 1: download reference genome and gtf files

Human reference genome UCSC hg38 with the gene annotation (.gtf) were downloaded from illumina iGenomes collections : https://support.illumina.com/sequencing/sequencing_software/igenome.html

$ wget http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Homo\_sapiens/UCSC/hg38/Homo\_sapiens\_UCSC\_hg38.tar.gz

$ tar xzvf Homo_sapiens_UCSC_hg38.tar.gz

$ cp Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa .

$ cp Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf .

Section 2: collect all illumia raw data (.fastq.gz) into one folder

$ mkdir RAW_DATA/

You may use "ln -s" command to create the soft links to the original locations of raw data.

Here human RNA sequencing data were downloaded through GEO accession number GSE30352, including brain, heart, and testis data with biological replicates.

Samples	GEO number	SRR number
Brain replicate 1	GSM752691	SRR306838
Brain replicate 2	GSM752694	SRR306841
Brain replicate 3	GSM752692	SRR306839
Heart replicate 1	GSM752699	SRR306847
Heart replicate 2	GSM752701	SRR306850
Testis replicate 1	GSM752707	SRR306857
Testis replicate 2	GSM752708	SRR306858

To build small testing data sets, first 1,000,000 reads in each sample was used here.

$zcat ~/original_locations/hsa.br.F.1_GSM752691_R1.fastq.gz |head -n 4000000|gzip > RAW_DATA/hsa.br.F.1_GSM752691_R1.fastq.gz

Section 3: modify the GeneTEFlow configuration file coordinately

Parameters Configuration

These parameters below would be editable in the configuration file so that researchers could manually change these settings according to the type of their RNA-Seq.

Name	Default value	Description
params.reads	`./RAW_DATA/*_R{1,2}.fastq.gz`	The input RAW Fastq files
params.adapter_trim_tag	`Y`	specify to run adapter trimming : "Y"(yes) or "N"(no)
params.DESeq_run_tag	`Y`	specify to run DESeq2 for differetial expression analysis : "Y"(yes) or "N"(no)
params.DESeq_replicates	`Y`	specify to run DESeq2 with/without replicates: "Y"(with replicates) or "N"(without replicates)
params.sampleinfoxlsx	`sampledetail.xlsx`	specify a EXCEL file that stored RNA-Seq samples information
params.sample.manifest.sheetname	`sample.manifest`	The sheet name in EXCEL file for each RNA-Seq sample information
params.samplecompare.sheetname	`samplecompare`	The sheet name in EXCEL file for defining RNA-Seq sample comparisons
params.deseq.log2FC.gene	`1`	specify the cut off of log2 based foldchange for identifying differential gene expression
params.deseq.fdr.gene	`0.05`	specify the cut off of adjusted P-vlaue for identifying differential gene expression
params.deseq.gmean.gene	`50`	specify the cut off of max counts of group mean in DESeq2 result to filter out some low expressed genes
params.forwardprob	`0.5`	specify the strand specific information for read alignment (Please refer RSEM parameter “--forward-prob” by the link http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html)
params.gtf	`genes.gtf`	specify the GTF annotation file for the analysis of differential gene expression
params.fasta	`genome.fa`	specify the reference genome for the analysis of differential gene expression
params.TE_pipeline_run_tag	`Y`	specify to run the analysis for identifying differential TE expression : "Y"(yes) or "N"(no)
params.deseq.log2FC.TE	`1`	specify the cut off of log2 based foldchange for identifying differential TE expression
params.deseq.fdr.TE	`0.05`	specify the cut off of adjusted P-vlaue for identifying differential TE expression
params.deseq.gmean.TE	`50`	specify the cut off of max counts of group mean in DESeq2 result to filter out some low expressed TEs
params.squireFetch.genome	`hg38`	specify the reference genome versions: hg38, hg19, mm10, mm9, etc. (Please refer SQuIRE "squire Fetch" by the link https://github.com/wyang17/SQuIRE#arguments-for-each-step)
params.TrimmedDir	`Trimmed_RawData`	specify the name of output folder for trimmed reads
params.FastQCdir	`FastQC_Results`	specify the name of output folder for QC results by FASTQC
params.sampleinfoDir	`SampleinfoDir`	specify the name of output folder for TXT files that stored RNA-Seq samples information and comparisons
params.ReportDir	`Report_Results`	specify the name of output folder for results genetated by the pipeline
params.AllResultsDir	`All_Results`	specify the name of output folder for intermediate data and results genetated by the pipeline

Optional 1: configuration file for docker container

GeneTEFlow can be run locally by specifying it in the configuration file:

process.executor = 'local'

GeneTEFlow provides functions to process both single-end and paired-end reads respectively. Please see "geneTEflow.SE.docker.config" and "geneTEflow.PE.docker.config".

1. Single-end reads:

For example,

Specify the location of RAW data:

params.reads = "./RAW_DATA/*_R1.fastq.gz"

Specify the details of samples information:

params.sampleinfoxlsx = "SE_Nextflow_pipeline.Human_data.xlsx"

2. Paired-end reads:

For example,

Specify the location of RAW data:

params.reads = "./RAW_DATA/*_R{1,2}.fastq.gz"

Specify the details of samples information:

params.sampleinfoxlsx = "PE_sampledetail.xlsx"

Please refer more details of configurations on https://www.nextflow.io/docs/latest/executor.html

Optional 2: configuration file for Singularity container

GeneTEFlow can be run on HPC LSF system by specifying in the configuration:

process.executor = 'lsf'

GeneTEFlow provides functions can process both single-end and paired-end reads. Please see "geneTEflow.SE.Singularity.config" and "geneTEflow.PE.Singularity.config".

1. Single-end reads:

For example,

Specify the location of RAW data:

params.reads = "./RAW_DATA/*_R1.fastq.gz"

Specify the details of samples information:

params.sampleinfoxlsx = "SE_Nextflow_pipeline.Human_data.xlsx"

2. Paired-end reads:

For example,

Specify the location of RAW data:

params.reads = "./RAW_DATA/*_R{1,2}.fastq.gz"

Specify the details of samples information:

params.sampleinfoxlsx = "PE_sampledetail.xlsx"

Please refer more details of configurations on https://www.nextflow.io/docs/latest/executor.html

Section 4: running GeneTEFlow

Optional 1: running GeneTEFlow by interacting with docker containers

Single-end reads:

$ ~/nextflow run ~/GeneTEflow_pipelines/pipeline.SE.nf -c ~/GeneTEflow_pipelines/geneTEflow.SE.docker.config -with-dag flowchart.html -with-report nf.report.html -with-timeline nf.timeline.html

Paired-end reads:

$ ~/nextflow run ~/GeneTEflow_pipelines/pipeline.PE.nf -c ~/GeneTEflow_pipelines/geneTEflow.PE.docker.config -with-dag flowchart.html -with-report nf.report.html -with-timeline nf.timeline.html

Optional 2: running GeneTEFlow by interacting with Singularity containers

Single-end reads:

$ ~/nextflow run ~/GeneTEflow_pipelines/pipeline.SE.nf -c ~/GeneTEflow_pipelines/geneTEflow.SE.Singularity.config -with-dag flowchart.html -with-report nf.report.html -with-timeline nf.timeline.html

Paired-end reads:

$ ~/nextflow run ~/GeneTEflow_pipelines/pipeline.PE.nf -c ~/GeneTEflow_pipelines/geneTEflow.PE.Singularity.config -with-dag flowchart.html -with-report nf.report.html -with-timeline nf.timeline.html

Section 5: Results generated by GeneTEFlow

Here human RNA sequencing data (GSE30352) were used as one example.

Significantly regulated genes identified by GeneTEFlow:

Significantly regulated transposable elements identified by GeneTEFlow:

Section 6: Log files generated by GeneTEFlow

GeneTEFlow generates three major log files: nf.report.html, nf.timeline.html, and flowchart.html.

One example is shown here from nf.report.html:

4. Q & A Section

1. Can I use a different human genome version (eg. hg19) or a different species (eg. mouse) in GeneTEFlow pipeline? If yes, please provide some instructions to do those. Eg. where to download the genome and gene annotation file.

Yes, you could choose your specific species and genome version.

Here we use Mus musculus (Mouse) mm10 as one example: Mouse reference genome UCSC mm10 with the gene annotation (.gtf) were downloaded from illumina iGenomes collections : https://support.illumina.com/sequencing/sequencing_software/igenome.html

$ wget http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz

$ tar xzvf Mus_musculus_UCSC_mm10.tar.gz

$ cp Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa .

$ cp Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf .

Also, you need to change parameter “params.squireFetch.genome” (Please see "Configuration Parameters" section)

params.squireFetch.genome = mm10

It would instruct SQuIRE to download mm10-related reference genome and TE annotations.

2. Can I run GeneTEFlow pipeline in a step-by-step mode and how?

Yes, the pipeline could be run flexibly in a step-by-step mode.

(1) Example 1

For example, if you may already have bam files, you could perform the analysis for the quantification of gene expression with RSEM directly, and skip the alignment step with STAR.
The command line is below:

$docker run   -v  /your_working_directory:/mnt   -w /mnt    rnaseq_pipeline.app   rsem-calculate-expression  --bam --no-bam-output -p 8   --paired-end  /mnt/your.bam  /mnt/RSEMIndex_hg38_UCSC/hg38_UCSC    /mnt/RSEM_Output

(2) Example 2

More intelligent way running in a step-by-step mode: For example, we could run QC first, removing some low-quality samples, and then continue downstream analysis.
The command line is below:

$ ~/nextflow run ~/GeneTEflow_pipelines/pipeline.SE.QC.nf -c ~/GeneTEflow_pipelines/geneTEflow.SE.docker.QC.config -with-dag flowchart.html -with-report nf.report.html -with-timeline nf.timeline.html

After you check the QC results, you could remove some low-quality samples, and then continue downstream analysis.
The command line is below:

$~/nextflow run ~/GeneTEflow_pipelines/pipeline.SE.afterQC.nf -c ~/GeneTEflow_pipelines/geneTEflow.SE.docker.afterQC.config -with-dag flowchart.html -with-report nf.report.html -with-timeline nf.timeline.html

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
GeneTEFlow_Dockerfiles		GeneTEFlow_Dockerfiles
GeneTEflow_pipelines		GeneTEflow_pipelines
images		images
README.md		README.md
Small_DataSets.zip		Small_DataSets.zip
Tutorial.docx		Tutorial.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeneTEFlow: A Nextflow-based one-stop pipeline for differential expression analysis of genes and locus-specific transposable elements from RNA sequencing

1. Introduction

2. Installation

Section 1: Install docker and singularity (need "root" permission)

Section 2: Getting GeneTEflow from github:

Section 3: Build images (need "root" permission)

Section 4: Testing containers

Section 5: install Nextflow

3. Running GeneTEFlow

Section 1: download reference genome and gtf files

Section 2: collect all illumia raw data (.fastq.gz) into one folder

Section 3: modify the GeneTEFlow configuration file coordinately

Parameters Configuration

Section 4: running GeneTEFlow

Section 5: Results generated by GeneTEFlow

Section 6: Log files generated by GeneTEFlow

4. Q & A Section

(1) Example 1

(2) Example 2

About

Releases

Packages

Contributors 2

Languages

zhongw2/GeneTEFlow

Folders and files

Latest commit

History

Repository files navigation

GeneTEFlow: A Nextflow-based one-stop pipeline for differential expression analysis of genes and locus-specific transposable elements from RNA sequencing

1. Introduction

2. Installation

Section 1: Install docker and singularity (need "root" permission)

Section 2: Getting GeneTEflow from github:

Section 3: Build images (need "root" permission)

Section 4: Testing containers

Section 5: install Nextflow

3. Running GeneTEFlow

Section 1: download reference genome and gtf files

Section 2: collect all illumia raw data (.fastq.gz) into one folder

Section 3: modify the GeneTEFlow configuration file coordinately

Parameters Configuration

Section 4: running GeneTEFlow

Section 5: Results generated by GeneTEFlow

Section 6: Log files generated by GeneTEFlow

4. Q & A Section

(1) Example 1

(2) Example 2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages