-
Notifications
You must be signed in to change notification settings - Fork 3
Home
- To get familiarised with R scripting
- Understanding the public data available through an article
- Find differentially expressed genes in PrimaryColon Vs Normal and Metastasis Vs PrimaryColon
-
Please refer the github page complete analysis pipeline steps and the corresponding scripts uploaded in the link https://github.com/IBEXCluster/B322.
-
Here is the simple way to download all the scripts for your testing.
- Please login into Ibex and execute the below command (git clone).
git clone --recursive https://github.com/IBEXCluster/B322.git
(This will download all the scripts & dataset for R available in the GitHub repository)
- Change the directory to B322
cd B322
- To execute anyone of the script, use sh <script name>.
For example: sh ./01_download.sh
This above script will ask the SRRxxx number to download the dataset. This script will create a data/ directory and the job will be submitted to queue. The SRRxxx files (Read1 and Read2) will be downloaded into data/ directory.
Scripts summary for your hands-on session
-
01_download.sh: Script is to download Tutorial data from the GEO accession GSE50760
-
02_quality_control.sh: Script is to run quality control tool for high throughput sequence data using FastQC
-
03_trimming.sh: Script is to Trim the Fastq file using TRIMMOMATIC
-
04_quality_control_after_trimming.sh: Script is to run quality control tool for high throughput sequence data using FastQC AFTER trimming the reads
-
05_multiQC.sh: Script is to provide Aggregate results from multiple FastQC files generted from above step 04.
-
06a_mapping_raw_reads.sh:(Optional) Script is to run mapper for RNA-Seq reads (Raw reads from 01 step) using Homo_sapiens GRCh38 reference
-
06b_mapping_trimmed_reads.sh: Script is to run mapper for RNA-Seq reads (from the step 04, paired fastq.gz files using Homo_sapiens GRCh38 reference
-
07_HTSeq.sh: Script is to Analyse the high-throughput sequencing data
Differential expression hands-on session
- The above link has a folder “counts” where the files needed for differential expression task is available.
- make sure you create a directory for this hands on ibex either on your home directory or under your scratch folder.
- copy the meta data and counts files for GSE50760 in your working directory.
- Go through edgeR user manual for differential expression analysis.
- Refer IntroductionToR slides for exploratory data analysis.
For this exercise, you need to login to ibex and load the following modules on command line
Loading modules
module load R/3.6.0/gnu-6.4.0
module load RStudio_Desktop/1.1.383
To invoke RStudio to work with R
type the below word, on the command line
rstudio &
##Load the following R libraries in your script
library(limma)
library(edgeR)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(reshape2)
library(GGally)
library(EnhancedVolcano)
ENSEMBL Id to Gene Symbol Code
####Code to replace ensembl ids in the count table, with gene symbol
ensg <- sub("\..*", "", rownames(cts)) # remove version number
sym <- mapIds(org.Hs.eg.db, keys=ensg, column="SYMBOL", keytype="ENSEMBL")
gene <- data.frame(ENSGID=ensg, SYMBOL=sym, stringsAsFactors=F) #use the above vector for you raw count rownames
Hands on Questions
Explore the data 1a. remove genes with no expression for all samples 1b. boxplot of library size for Tissue group and Subjects 1c. filter genes by expression before normalization 1d. Look at the difference in cpm of raw expression and filtered expression values by density plots
Data normalisation by TMM method
Run multidimensionality reduction like PCA or MDS Find out the outlier or not properly grouped samples and remove them for downstream analysis
Differential expression for two contrasts a. PrimaryColon Vs Normal, b. MetastasisColon Vs PrimaryColon
Find differentially expressed genes at p value < 0.01 NS lfc=log2(4)
Find the functional analysis of top 10 differentially expressed genes in case a and case b (Use either DAVID or gprofiler for this)
Handson Results (for submission)
Write a 3 page report on Population transcriptomic analysis explaining the workflow starting from input data to differential expression analysis. Explain in detail what does each part of the pipe line does,understand why you have to do that, tools used, output files, etc For differential expression analysis, make sure you could observe the DE genes, same as the authors of our reference paper.
We are looking for a report from a team of two students. If you can't make a group with your fellow student, we expect a report per student. Reports not following these rules will not be considered for evaluation. Any queries related this lab should be addressed via B322 slack channel https://kaust-ibex.slack.com/
Students are requested to send the report to [email protected], mentioning student(s) name and id
Deadline for submission: March 28th 5.00pm