-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #999 from allyhawkins/allyhawkins/explore-consensu…
…s-labels Notebook for exploring consensus cell type labels across ScPCA samples
- Loading branch information
Showing
6 changed files
with
2,676 additions
and
0 deletions.
There are no files selected for viewing
291 changes: 291 additions & 0 deletions
291
analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,291 @@ | ||
--- | ||
title: "Explore consensus cell types" | ||
author: Ally Hawkins | ||
date: "`r Sys.Date()`" | ||
output: | ||
html_notebook: | ||
toc: true | ||
toc_depth: 3 | ||
code_folding: show | ||
--- | ||
|
||
This notebook summarizes the findings from assigning consensus cell type labels to all ScPCA samples. | ||
All results from the `cell-type-consensus` module in `OpenScPCA-nf` must be saved to `results` prior to rendering this notebook. | ||
|
||
```{r packages} | ||
suppressPackageStartupMessages({ | ||
# load required packages | ||
library(ggplot2) | ||
}) | ||
# Set default ggplot theme | ||
theme_set( | ||
theme_classic() | ||
) | ||
``` | ||
|
||
## Functions | ||
|
||
```{r} | ||
# function to read in project data frames with all cells in a project | ||
# output is a summarized table with total cells per sample, total cells per annotation, and number of cell types | ||
summarize_celltypes <- function(file, id){ | ||
# read in data | ||
df <- readr::read_tsv(file) | ||
# get total cell count and number of assigned cell types per library | ||
total_cells_df <- df |> | ||
dplyr::group_by(library_id) |> | ||
dplyr::summarize( | ||
total_cells_per_library = length(library_id), | ||
num_celltypes = length(unique(consensus_annotation)) | ||
) | ||
summary_df <- df |> | ||
dplyr::group_by(library_id, consensus_annotation, consensus_ontology) |> | ||
dplyr::summarize(total_cells_per_annotation = length(consensus_annotation)) |> | ||
dplyr::left_join(total_cells_df, by = "library_id") |> | ||
dplyr::mutate( | ||
# add percentage | ||
percent_cells_annotation = round((total_cells_per_annotation / total_cells_per_library) * 100, 2) | ||
) |> | ||
dplyr::ungroup() | ||
return(summary_df) | ||
} | ||
``` | ||
|
||
## Data setup | ||
|
||
|
||
```{r base paths} | ||
# The base path for the OpenScPCA repository, found by its (hidden) .git directory | ||
repository_base <- rprojroot::find_root(rprojroot::is_git_root) | ||
module_base <- file.path(repository_base, "analyses", "cell-type-consensus") | ||
# results directory with cell-type-consensus | ||
results_dir <- file.path(module_base, "results", "cell-type-consensus") | ||
# diagnoses table used for labeling plots | ||
diagnoses_file <- file.path(module_base, "sample-info", "project-diagnoses.tsv") | ||
``` | ||
|
||
```{r} | ||
# list all results files | ||
results_files <- list.files(results_dir, pattern = "_consensus-cell-types\\.tsv.\\gz$", full.names = TRUE) | ||
# get project ids from file list | ||
project_ids <- stringr::str_remove(basename(results_files), "_consensus-cell-types.tsv.gz") | ||
names(results_files) <- project_ids | ||
# remove cell line projects from file list | ||
cell_line_projects <- c("SCPCP000020", "SCPCP000024") | ||
project_ids <- setdiff(project_ids, cell_line_projects) # remove cell line projects | ||
results_files <- results_files[project_ids] | ||
``` | ||
|
||
|
||
```{r, message=FALSE} | ||
# read in diagnoses | ||
diagnoses_df <- readr::read_tsv(diagnoses_file) | ||
# read in results and prep data frame for plotting | ||
all_results_df <- results_files |> | ||
purrr::imap(summarize_celltypes) |> | ||
dplyr::bind_rows(.id = "project_id") |> | ||
# add in diagnoses | ||
dplyr::left_join(diagnoses_df, by = "project_id") |> | ||
dplyr::mutate( | ||
# create a label for plotting | ||
project_label = glue::glue("{project_id}:{diagnosis}") | ||
) | ||
``` | ||
|
||
## Is it all just Unknown? | ||
|
||
The first thing we will look at is how many of the cells in each sample are categorized as "Unknown", which means no consensus between `SingleR` and `CellAssign` was identified. | ||
|
||
```{r, fig.height=7} | ||
unknown_only <- all_results_df |> | ||
dplyr::filter(consensus_annotation == "Unknown") | ||
ggplot(unknown_only, aes(x = project_label, y = percent_cells_annotation)) + | ||
ggforce::geom_sina(size = 0.1) + | ||
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), | ||
plot.margin = margin(10,10,10,10)) + | ||
labs( | ||
x = "", | ||
y = "Percent of cells annotated as Unknown" | ||
) | ||
``` | ||
|
||
It looks like we do have some samples that aren't just all "Unknown"! | ||
It definitely varies by project, but for most projects we at least see some proportion of samples with assigned cell types. | ||
|
||
Let's look at how many samples actually have some cells outside of unknown identified. | ||
To do this, we will identify all libraries that only have cells called as "Unknown". | ||
|
||
```{r} | ||
high_tumor_df <- unknown_only |> | ||
dplyr::mutate(no_cells_identified = percent_cells_annotation == 100) |> | ||
dplyr::group_by(project_label) |> | ||
dplyr::summarize(all_unknown = sum(no_cells_identified), | ||
classified_cells = sum(!no_cells_identified), | ||
percentage_unknown = round(all_unknown/(all_unknown + classified_cells)*100, 2), | ||
# add number of libraries for plotting | ||
total_libraries = length(library_id)) |> | ||
# set order for plots | ||
dplyr::mutate(project_label = forcats::fct_reorder(project_label, total_libraries, .desc = TRUE)) | ||
``` | ||
|
||
|
||
Which projects have the highest proportion of samples with all "Unknown"? | ||
|
||
```{r} | ||
# table with percentage of samples | ||
high_tumor_df |> | ||
dplyr::select(project_label, percentage_unknown) |> | ||
dplyr::arrange(desc(percentage_unknown)) | ||
``` | ||
|
||
It looks like all projects do have cell types identified that are not "Unknown". | ||
However, `SCPCP000011` (retinoblastoma), has a fairly high percentage of samples without any consensus labels. | ||
|
||
## Number of cell types observed | ||
|
||
Below we look at the number of cell types observed in each project for all samples. | ||
This does not include cells labeled as "Unknown". | ||
|
||
```{r, fig.height=10} | ||
num_celltypes_df <- all_results_df |> | ||
# add a new line for facet labels | ||
dplyr::mutate(facet_label = glue::glue("{project_id}\n{diagnosis}")) |> | ||
# remove unknown as a cell type | ||
dplyr::filter(consensus_annotation != "Unknown") |> | ||
dplyr::select(facet_label, library_id, num_celltypes) |> | ||
unique() | ||
ggplot(num_celltypes_df, aes(x = num_celltypes)) + | ||
geom_histogram(binwidth = 1, center = 0) + | ||
facet_wrap(vars(facet_label), | ||
ncol = 3) + | ||
labs( | ||
x = "Number of cell types" | ||
) + | ||
theme_bw() | ||
``` | ||
|
||
## Distribution of consensus cell types | ||
|
||
Now we look at the distribution of the cell types in each sample. | ||
For these plots, we will pull out the top 9 cell types for each project. | ||
All other cells will be labeled with "All remaining cell types". | ||
|
||
The top cell types are determined by counting how many libraries each cell type is found in within a project and taking the most frequent types. | ||
|
||
```{r} | ||
plot_df <- all_results_df |> | ||
dplyr::group_by(project_id) |> | ||
dplyr::mutate( | ||
# get most frequently observed cell types across libraries in that project | ||
top_celltypes = forcats::fct_lump_n(consensus_annotation, 9, other_level = "All remaining cell types", ties.method = "first") |> | ||
# sort by frequency | ||
forcats::fct_infreq() |> | ||
# make sure all remaining and unknown are last, use this to assign colors in specific order | ||
forcats::fct_relevel("All remaining cell types", "Unknown", after = Inf) | ||
) | ||
# get all unique cell types ordered by frequency | ||
unique_celltypes <- plot_df |> | ||
dplyr::filter(!top_celltypes %in% c("All remaining cell types", "Unknown")) |> | ||
dplyr::pull(top_celltypes) |> | ||
unique() |> | ||
sort() |> | ||
as.character() | ||
# get color palette | ||
colors <- c( | ||
palette.colors(palette = "alphabet"), | ||
"black", # 1 extra since alphabet is 26 and we have 27, this will be plasma cell which shows up once | ||
"grey60", | ||
"grey95" | ||
) | ||
names(colors) <- c(unique_celltypes, "All remaining cell types", "Unknown") | ||
``` | ||
|
||
|
||
```{r, fig.height=60, fig.width=10} | ||
project_labels <- unique(all_results_df$project_label) | ||
# stacked bar chart showing the distribution of the top 9 cell types for each project, including Unknown | ||
project_labels |> | ||
purrr::map(\(label){ | ||
project_df <- plot_df |> | ||
dplyr::filter(project_label == label) |> | ||
dplyr::mutate( | ||
# relevel factors for specific project | ||
top_celltypes = forcats::fct_infreq(top_celltypes) |> | ||
forcats::fct_relevel("All remaining cell types", "Unknown", after = Inf) | ||
) | ||
# make a stacked bar chart with top cell types | ||
ggplot(project_df) + | ||
aes( | ||
x = library_id, | ||
y = percent_cells_annotation, | ||
fill = top_celltypes | ||
) + | ||
geom_col() + | ||
scale_y_continuous(expand = c(0,0)) + | ||
scale_fill_manual(values = colors, name = "cell type") + | ||
ggtitle(label) + | ||
theme(axis.text.x = element_blank()) | ||
}) |> | ||
patchwork::wrap_plots(ncol = 1) | ||
``` | ||
|
||
|
||
This looks really promising! | ||
A few observations: | ||
|
||
- Cell types identified tend to line up with expectations for the type of tumor. | ||
For example, leukemia libraries have T and B cells, brain tumors have macrophages, and solid tumors have fibroblasts and muscle cells. | ||
- Projects that I would expect to be more difficult to classify (sarcomas, wilms, RB) have fewer cells classified then things like brain and leukemia. | ||
Notably many of the solid tumor projects (4, 5, 12-16, and 23) have a handful of PDX samples where I would expect to see fewer normal cells. | ||
|
||
## Most frequently observed cell types | ||
|
||
The last thing we will do is look at the most frequently observed cell types across all samples. | ||
The below table is ordered by the number of libraries the cell type is observed. | ||
|
||
```{r} | ||
all_results_df |> | ||
dplyr::filter(consensus_annotation != "Unknown") |> | ||
dplyr::group_by(consensus_annotation) |> | ||
dplyr::summarize( | ||
total_libraries = dplyr::n(), | ||
min_percentage = min(percent_cells_annotation), | ||
mean_percentage = round(mean(percent_cells_annotation), 2), | ||
median_percentage = median(percent_cells_annotation), | ||
max_percentage = max(percent_cells_annotation) | ||
) |> | ||
dplyr::arrange(desc(total_libraries)) | ||
``` | ||
|
||
|
||
## Session info | ||
|
||
```{r session info} | ||
# record the versions of the packages used in this analysis and other environment information | ||
sessionInfo() | ||
``` | ||
|
2,247 changes: 2,247 additions & 0 deletions
2,247
analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.nb.html
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Exploratory notebooks | ||
|
||
This folder contains exploratory notebooks for this module. | ||
|
||
1. `01-reference-exploration.Rmd`: This notebook was used to explore possible consensus label assignments between cell types in the `PanglaoDB` and `BlueprintEncodeData` references. | ||
Observations made in this notebook were used to define the set of possible consensus labels to be included in [`references/consensus-cell-type-reference.tsv`](../references/consensus-cell-type-reference.tsv). | ||
|
||
2. `01-explore-consensus-results.Rmd`: This notebook summarizes the consensus labels assigned to all ScPCA samples. | ||
Prior to rendering this notebook results from the `cell-type-consensus` module in `OpenScPCA-nf` using the `2024-11-25` were downloaded. |
Oops, something went wrong.