This repository was archived by the owner on Apr 8, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_seurat_clustering.Rmd
356 lines (259 loc) · 15.9 KB
/
03_seurat_clustering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: "Seurat Clustering"
author: "`r getOption('author')`"
date: "`r Sys.Date()`"
bibliography: bibliography.bib
params:
bcbFile: "data/bcb_pool_filtered.rda"
seuratName: "seurat"
pcCompute: 20
pcUse: FALSE
varsToRegress: !r c("nUMI", "S.Score", "G2M.Score")
resolution: 0.6
outputDir: "."
---
```{r setup, cache=FALSE, message=FALSE, warning=FALSE}
library(bcbioSingleCell)
# Shared RMarkdown settings
prepareSingleCellTemplate()
if (file.exists("setup.R")) {
source("setup.R")
}
# Directory paths
dataDir <- file.path(params$outputDir, "data")
markersDir <- file.path(params$outputDir, "results", "markers")
dir.create(markersDir, recursive = TRUE, showWarnings = FALSE)
# Load bcbioSingleCell object
bcbName <- load(params$bcbFile)
bcb <- get(bcbName, inherits = FALSE)
# Vector to use for plot looping
groupBy <- unique(c(
"ident", "sampleName", "Phase", interestingGroups(bcb)
))
# knitr arguments (for `rmarkdown::render()` looping)
opts_chunk$set(
cache.path = paste(
params$seuratName,
"clustering",
"cache/",
sep = "_"),
fig.path = paste(
params$seuratName,
"clustering",
"files/",
sep = "_")
)
```
```{r header, child="_header.Rmd", eval=file.exists("_header.Rmd")}
```
```{r sample_metadata}
sampleMetadata(bcb)
```
This workflow is adapted from the following sources:
- Satija Lab: [Seurat v2 Guided Clustering Tutorial](http://satijalab.org/seurat/pbmc3k_tutorial.html)
- Paul Hoffman: [Cell-Cycle Scoring and Regression](http://satijalab.org/seurat/cell_cycle_vignette.html)
* * *
# Initialize Seurat (`r bcbName`)
First, let's coerce our `bcbioSingleCell` object into a `seurat` object, without applying any additional filtering. Next, global-scaling normalization is applied to the raw counts with `NormalizeData()`, which (1) normalizes the gene expression measurements for each cell by the total expression, (2) multiplies this by a scale factor (10,000 by default), and (3) log-transforms the result. `FindVariableGenes()` is then called, which calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression. Finally, the genes are scaled and centered using the `ScaleData()` function.
```{r seurat, results="hide"}
# S4 object coercion method using `setAs()`, documented in `setAs.R` file. This
# handles gene to symbol conversion and stashes metadata in the @misc slot.
# getMethod("coerce", signature(from = "bcbioSingleCell", to = "seurat"))
seurat <- as(bcb, "seurat") %>%
NormalizeData(
object = .,
normalization.method = "LogNormalize",
scale.factor = 10000) %>%
FindVariableGenes(
object = .,
mean.function = ExpMean,
dispersion.function = LogVMR,
do.plot = FALSE) %>%
ScaleData(
object = .,
model.use = "linear")
```
# Plot variable genes
Here we are looking at the dispersion of the variable genes in the dataset. `VariableGenePlot()` plots dispersion (a normalized measure of to cell-to-cell variation) as a function of average expression for each gene. Our goal is to identify a set of high-variance genes, and we recommend setting the cutoff parameters in this function by visually evaluating this plot to define outliers. However, particularly when using UMI data, we often obtain similar results including much larger gene sets, including the entire transcriptome.
```{r variable_gene_plot}
VariableGenePlot(seurat)
```
# Regress out unwanted sources of variation
Your single-cell dataset likely contains "uninteresting" sources of variation. This can include technical noise, batch effects, and/or uncontrolled biological variation (e.g. cell cycle). Regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering [@Buettner2015-ur]. To mitigate the effect of these signals, [Seurat][] constructs linear models to predict gene expression based on user-defined variables. The scaled z-scored residuals of these models are stored in the `[email protected]` slot, and are used for dimensionality reduction and clustering.
## Cell-cycle scoring
First, we assign each cell a score, based on its expression of G2/M and S phase markers. These marker sets should be anticorrelated in their expression levels, and cells expressing neither are likely not cycling and in G1 phase. We assign scores in the `CellCycleScoring()` function, which stores S and G2/M scores in `[email protected]`, along with the predicted classification of each cell in either G2M, S or G1 phase.
```{r cell_cycle_markers}
ccm <- metadata(bcb)$organism %>%
str_match("^([A-Z])[a-z]+ ([a-z]+)$") %>%
.[, 2:3] %>%
as.character() %>%
paste0(collapse = "") %>%
tolower() %>%
cellCycleMarkers[[.]]
sGenes <- ccm %>%
dplyr::filter(phase == "S") %>%
pull("symbol")
g2mGenes <- ccm %>%
dplyr::filter(phase == "G2/M") %>%
pull("symbol")
```
```{r cell_cycle_scoring}
seurat <- CellCycleScoring(
seurat,
g2m.genes = g2mGenes,
s.genes = sGenes)
# Cell-cycle `Phase` column should now be added to `[email protected]`
assignAndSaveData(
name = paste(params$seuratName, "preregress", sep = "_"),
object = seurat,
dir = dataDir)
```
Here we are checking to see if the cells are grouping by cell cycle. If we don't see clear grouping of the cells into `G1`, `G2M`, and `S` clusters on the PCA plot, then it is recommended that we don't regress out cell-cycle variation. When this is the case, remove `S.Score` and `G2M.Score` from the variables to regress (`varsToRegress`) in the RMarkdown YAML parameters.
```{r cell_cycle_pca_preregress}
# colnames([email protected])
RunPCA(
seurat,
pc.genes = c(sGenes, g2mGenes),
do.print = FALSE) %>%
plotPCA(interestingGroups = "Phase", label = FALSE)
```
## Apply regression variables
Here we are regressing out variables of uninteresting variation, using the `vars.to.regress` argument in the `ScaleData()` function. When variables are defined in the `vars.to.regress` argument, [Seurat][] regresses them individually against each gene, then rescales and centers the resulting residuals. We generally recommend minimizing the effects of variable read count depth (`nUMI`) and mitochondrial gene expression (`mitoRatio`) as a standard first-pass approach. If the differences in mitochondrial gene expression represent a biological phenomenon that may help to distinguish cell clusters, then we advise not passing in `mitoRatio` here. When regressing out the effects of cell-cycle variation, include `S.Score` and `G2M.Score` in the `vars.to.regress` argument. Cell-cycle regression is generally recommended but should be avoided for samples containing cells undergoing differentiation.
```{r scale_data, results="hide"}
seurat <- ScaleData(seurat, vars.to.regress = params$varsToRegress)
```
Now that regression has been applied, let's recheck to see if the cells are no longer clustering by cycle. We should now see the phase clusters superimpose.
```{r cell_cycle_pca_postregress}
# colnames([email protected])
RunPCA(
seurat,
pc.genes = c(sGenes, g2mGenes),
do.print = FALSE) %>%
plotPCA(interestingGroups = "Phase", label = FALSE)
```
# Linear dimensionality reduction {.tabset}
Next, we perform principal component analysis (PCA) on the scaled data with `RunPCA()`. By default, the genes in `[email protected]` are used as input, but can be defined using the `pc.genes` argument. We have typically found that running dimensionality reduction on highly variable genes can improve performance. However, with UMI data — particularly after regressing out technical variables — we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.
`ProjectPCA()` scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. Though we don't use this further here, it can be used to identify markers that are strongly correlated with cellular heterogeneity, but may not have passed through variable gene selection. The results of the projected PCA can be explored by setting `use.full = TRUE` for `PrintPCA()`.
```{r run_pca}
seurat <- seurat %>%
RunPCA(do.print = FALSE) %>%
ProjectPCA(do.print = FALSE)
```
In particular, `PCHeatmap()` allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and genes are ordered according to their PCA scores. Setting `cells.use` to a number plots the "extreme" cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.
## `PCHeatmap()`
```{r pc_heatmap, fig.height=30, fig.width=10}
PCHeatmap(
seurat,
col.use = CustomPalette(
low = viridis(3)[[1]],
mid = viridis(3)[[2]],
high = viridis(3)[[3]]),
do.balanced = TRUE,
label.columns = FALSE,
pc.use = 1:params$pcCompute,
remove.key = TRUE)
```
## `VizPCA()`
```{r viz_pca, fig.height=40, fig.width=8}
VizPCA(
seurat,
pcs.use = 1:params$pcCompute,
do.balanced = TRUE,
nCol = 2)
```
## `PrintPCA()`
```{r print_pca}
PrintPCA(
seurat,
pcs.print = 1:params$pcCompute)
```
# Determine statistically significant principal components
To overcome the extensive technical noise in any single gene for scRNA-seq data, [Seurat][] clusters cells based on their PCA scores, with each PC essentially representing a "metagene" that combines information across a correlated gene set. Determining how many PCs to include downstream is therefore an important step. To accomplish this, we plot the standard deviation of each PC as an elbow plot with `PCElbowPlot()`.
PC selection — identifying the true dimensionality of a dataset — is an important step for [Seurat][], but can be challenging/uncertain. We therefore suggest these three approaches to consider:
1. Supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example.
2. Implement a statistical test based on a random null model. This can be time-consuming for large datasets, and may not return a clear PC cutoff.
3. **Heuristic approach**, using a metric that can be calculated instantly.
We're using a heuristic approach here, by calculating where the principal components start to elbow. The plots below show where we have defined the principal compoment cutoff used downstream for dimensionality reduction. This is calculated automatically as the larger value of:
1. The point where the principal components only contribute 5% of standard deviation (bottom left).
2. The point where the principal components cumulatively contribute 90% of the standard deviation (bottom right).
This methodology is also commonly used for PC covariate analysis on bulk RNA-seq samples.
```{r pc_use}
pcUse <- params$pcUse
if (!is.numeric(params$pcUse)) {
pcUse <- plotPCElbow(seurat)
}
```
We are using `r length(pcUse)` principal components for dimensionality reduction calculations.
# Cluster the cells
Seurat now includes an graph-based clustering approach. Importantly, the *distance metric* which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard distance). To cluster the cells, we apply modularity optimization techniques [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function.
The `FindClusters()` function implements the procedure, and contains a `resolution` argument that sets the "granularity" of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between `0.6`-`1.2` typically returns good results for single cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters are saved in the `seurat@ident` slot.
Regarding the value of the `resolution` argument, use a value `< 1` if you want to obtain fewer clusters.
```{r find_clusters, results="hide"}
seurat <- FindClusters(
seurat,
dims.use = pcUse,
force.recalc = TRUE,
print.output = TRUE,
resolution = params$resolution,
save.SNN = TRUE)
```
A useful feature in [Seurat][] v2.0 is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, we provide the function PrintFindClustersParams to print a nicely formatted formatted summary of the parameters that were chosen.
```{r print_find_clusters_params}
PrintFindClustersParams(seurat)
```
# Run non-linear dimensional reduction (tSNE) {.tabset}
[Seurat][] continues to use tSNE as a powerful tool to visualize and explore these datasets. While we no longer advise clustering directly on tSNE components, cells within the graph-based clusters determined above should co-localize on the tSNE plot. This is because the tSNE aims to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space. As input to the tSNE, we suggest using the same PCs as input to the clustering analysis, although computing the tSNE based on scaled gene expression is also supported using the `genes.use` argument.
```{r run_tsne}
seurat <- RunTSNE(
seurat,
dims.use = pcUse,
do.fast = TRUE)
assignAndSaveData(
name = params$seuratName,
object = seurat,
dir = dataDir)
```
```{r print_tsne_params}
PrintTSNEParams(seurat)
```
```{r dim_reduction_plots, results="asis"}
lapply(seq_along(groupBy), function(a) {
mdHeader(groupBy[[a]], level = 2, asis = TRUE, tabset = TRUE)
mdHeader("tSNE", level = 3, asis = TRUE)
plotTSNE(
seurat,
interestingGroups = groupBy[[a]],
label = TRUE) %>%
show()
mdHeader("PCA", level = 3, asis = TRUE)
plotPCA(
seurat,
interestingGroups = groupBy[[a]],
label = TRUE) %>%
show()
}) %>%
invisible()
```
Note that tSNE is not PCA! The measurement of distance in a tSNE plot is difficult to interpret, and is most helpful for the relationships of close neighbors. To better infer separation distance between the putative clusters, let's reapply PCA.
# Cluster quality control
Let's look at the variance in the number of UMI counts (`nUMI`), gene detection (`nGene`), and the percentage of mitochondrial gene expression (`mitoRatio`), to see if there are any obvious cluster artefacts. We can also assess cell cycle batch effects (`S.Score`, `G2M.Score`) and any principal component bias toward individual clusters.
```{r plot_feature_metrics}
plotFeatureTSNE(
seurat,
features = c("nUMI", "nGene",
"log10GenesPerUMI", "mitoRatio",
"S.Score", "G2M.Score"),
pointSize = 1,
labelSize = 4.5
)
```
```{r plot_feature_pc}
plotFeatureTSNE(
seurat,
features = paste0("PC", pcUse),
pointSize = 0.8,
labelSize = 4
)
```
```{r footer, child="_footer.Rmd", eval=file.exists("_footer.Rmd")}
```