This repository contains the code for the scRNA-seq analysis conducted as part of my MSc research project. The analysis is divided into two main pipelines:
- Upstream Analysis Pipeline
- Downstream Analysis Pipeline
These pipelines are designed to ensure efficient and reproducible workflows for processing and analysing single-cell RNA sequencing (scRNA-seq) data.
The upstream analysis pipeline focuses on the initial processing and preparation of the raw scRNA-seq data. The steps included in this pipeline are:
-
Raw Data (Count Table):
The pipeline begins with the raw count table obtained from sequencing. -
Data Cleaning:
This step involves filtering and cleaning the raw data to remove any irrelevant or low-quality entries. -
Quality Control:
Quality control checks are performed to ensure the data is of high quality, removing any cells or genes that do not meet specific criteria. -
Doublet Removal:
This step identifies and removes potential doublets, which are instances where two cells are captured together, to avoid skewing the analysis. -
Normalisation:
The data is normalised to median total counts then subsequently log1p transformed. -
Batch Correction:
The data are batch corrected with scvi to account for sample level differences. -
PCA (Principal Component Analysis):
PCA is performed to reduce the dimensionality of the data, helping to identify major trends and patterns. -
kNN/UMAP:
Nearest neighbor clustering (kNN) and UMAP are applied to visualise the data in lower dimensions. -
Clustering:
Cells are clustered based on similarity, identifying groups of cells with similar expression profiles. -
Cell Annotation:
Finally, the clusters are annotated to assign biological meaning, identifying different cell types or states.
The output of this pipeline is an annotated data table that serves as the input for downstream analysis.
The downstream analysis pipeline focuses on deriving biological insights from the annotated data table produced by the upstream pipeline. The steps include:
-
DEG (Differential Expression Gene) Analysis:
Identification of genes that are differentially expressed between conditions or clusters. Scanpy, DESeq2 and limma-voom methods are all provided. -
Gene Ontology (GO) Analysis:
GO analysis is performed to identify biological processes, cellular components, and molecular functions that are enriched in the differentially expressed genes.- GO Enrichment Map:
Visualisation of the GO terms enriched in the dataset.
- GO Enrichment Map:
-
KEGG Enrichment Analysis:
The KEGG pathway analysis identifies pathways that are enriched in the differentially expressed genes.- Manual KEGG Pathway Analysis:
Further manual curation and interpretation of the KEGG pathways to understand the underlying biological processes.
- Manual KEGG Pathway Analysis:
To run the analysis, follow the instructions in the respective Jupyter notebooks and R scripts available in the repository.