Skip to content

Single cell RNA seq analysis

Nhi Hin edited this page Feb 8, 2021 · 9 revisions

Background knowledge

Pre-Processing

We are using several steps from the Cellranger 10X pipeline, which is described in more detail here. The basic steps are:

  1. cellranger mkfastq: demultiplexes raw base call (BCL) files generated by Illumina sequencers into FASTQ files. It is a wrapper around Illumina's bcl2fastq, with additional useful features that are specific to 10x libraries and a simplified sample sheet format.

  2. cellranger count takes FASTQ files from cellranger mkfastq and performs alignment, filtering, barcode counting, and UMI counting. It uses the Chromium cellular barcodes to generate feature-barcode matrices, determine clusters, and perform gene expression analysis. The count pipeline can take input from multiple sequencing runs on the same GEM well. cellranger count also processes Feature Barcode data alongside Gene Expression reads.

I don't use the cellranger aggregate step as I have found through experimentation that Seurat's integration functionality seems to work better for comparing relative expression between different samples (datasets).

Note: If you wish to quantify a marker gene (e.g. GFP), then it is necessary to add it to the reference before running Cellranger. Instructions can be found here.

The results of cellranger count can be imported into R using Seurat. The rest of the scRNA-seq analysis will use R.

Relevant R Packages

The following R packages should be installed:

  • Seurat: This is where the bulk of the scRNA-seq analysis happens. Seurat contains all the functions to do processing of the data, including quality checking (QC), removing low-quality/unwanted cells, normalisation, standardisation/scaling, and feature selection. It can also do all the basic analyses such as unsupervised clustering of cells and identifying marker genes for clusters/biological conditions.

  • Monocle 3: This is great for performing trajectory analysis. In this kind of analysis, the aim is to order the cells through "pseudotime", hence this allows us to see how cells are changing or differentiating into different cell types. This analysis is needed if we are trying to identify novel cell types in the data.

  • scPred (paper here): This uses a radial support vector machine (SVM) to build a model to classify cells. This works really well with Seurat objects and allows us to train a model on an existing, labelled, training dataset which is similar to the one being analysed. The model can then be applied to our own dataset to identify cell types based on the training dataset. A great feature of this is that it supports an "unassigned" group where cells which can't be categorised go into. The weakness of this is that you need a suitable training dataset, but it saves so much time in identifying the main cell types in the data and reduces the need for checking a tonne of marker genes.

  • Dorothea: This is a new approach to analysing scRNA-seq data that summarises gene expression into transcription factor activity. Dorothea is a collection of "regulons", describing the human/mouse genes which are regulated by transcription factors. It uses VIPER to estimate transcription factor activity based on the regulons. It seems to work very robustly from my experience and can make the interpretation of the data easier, especially since transcription factor activity can be quite characteristic to certain cell types (that have well-defined transcription factors). It works with Seurat objects, which makes it very convenient to rerun a Seurat analysis using the estimated transcription factor activity instead of gene expression. Background reading: paper that compares Dorothea/VIPER to other approaches.

Data Visualisation Resources

  • Code snippet for 3D plotting of Seurat object UMAP

  • schex allows cell layers to be "flattened" when visualising PCAs/UMAPs/t-SNEs, which makes it easy to see whether cells are expressing particular marker genes or not.

  • Seurat vignettes describe many visualisation methods for Seurat objects. The interactive plotting methods are definitely worth taking a look at.

General Analysis Workflow

  1. Import samples (datasets) into R separately, and use Seurat to analyse each dataset separately. See Seurat guided tutorial for the analysis steps. Note that clustering of individual datasets is not required if they are to be integrated later.

  2. Integrate the separate samples using Seurat's integration anchor functionality. This is described in the "Standard Workflow" tab of this page in the Seurat documentation.

Once the datasets have been integrated into a single Seurat object, the following analyses can be done depending on the aims of the project:

Check expression of marker genes / biological genes of interest

  • This might be for a transgene such as GFP or for other genes known to be of biological interest (e.g. marker genes of different cell types).

  • The FeaturePlot function of Seurat can be used to visualise expression of genes across a dataset.

Cell Classification (using scPred)

  • A tutorial for using scPred can be found here. Note: This requires a training dataset which has its cell types labelled. I recommend Tabula Muris for mouse datasets as they have many different tissue types and actually provide the Seurat objects for each tissue (as R objects). The R objects can be downloaded here.

  • Subclustering: If happy with scPred classifications, these can be further refined using Seurat. You need to use Seurat to subset the integrated Seurat object for the cells of interest (e.g. cells classified as leukocytes). The new subsetted seurat object needs to be scaled/standardised and subject to feature selection again, see this page in Seurat documentation: "note that if you wish to perform additional rounds of clustering after subsetting we recommend re-running FindVariableFeatures() and ScaleData()".

Marker gene identification / "Differential gene expression"

  • The FindMarkers or FindAllMarkers functions in Seurat can be used to either identify markers of a particular cell cluster, or identifying DE genes between different conditions/groups.

Unsupervised Clustering

  • I would definitely go for cell classification using scPred first as the results are very interpretable, but if a suitable training dataset is not available, then clustering would be the best approach to try and define the cell types present in the dataset. Clustering can be performed using the graph-based method implemented in Seurat.

Trajectory Analysis

See the following links:

  • This page goes through several methods that can be used for trajectory analysis including example R code.

  • This book chapter goes through a lot of the fundamentals of trajectory analysis (and scRNA-seq analysis in general), and includes instructions for running Slingshot.

  • Dynverse is a collection of R packages that use Docker to run trajectory analysis methods. It is very robust and easy to use, and has a Shiny interface for choosing the appropriate trajectory method. Lecture slides can be found here.

  • Monocle3 is also commonly used for trajectory analysis. A wrapper function is available from SeuratWrappers that converts a Seurat object into the CDS object accepted by Monocle3.

  • scVelo is a very popular method of using velocity (ratio between spliced and unspliced transcripts), which is an alternative to the trajectory methods described above. Velocity first needs to be calculated using a program like Velocyto, which runs on the command line. scVelo is a Python program, and a guide has been written that allows Seurat information to be transferred to this analysis.

  • I can't get this to work yet, but apparently scVelo can be run with Seurat objects directly exported as h5ad objects, see this guide.

R Workflow

  • We are using WorkflowR to organise analyses. My notes for setting up workflowr can be found here.