MapQuery for many cells and batches returns bad cluster prediction results #9467

martibonomi · 2024-11-09T04:04:44Z

martibonomi
Nov 9, 2024

Hi,

I am analysing scRNA-seq data from three different datasets comprising only T cells.

Since each dataset has several batches (170 in total) and I have ~435k cells, I wanted to create a reference using 30 batches (total 70k cells) and then project the remaining 140 batches (total 365k cells) on the reference to map the remaining cells and assign them a corresponding cluster based on the reference UMAP.

To do this, I integrated the reference using the RPCA algorithm since with CCA I did not obtain good results for the integration, and used the following code:

# Integrating the reference using RPCA
ref = IntegrateLayers(
  object = ref, method = RPCAIntegration, assay = "RNA",
  orig.reduction = "pca", new.reduction = "integrated.rpca.rna",
  verbose = TRUE
)

# Find clusters and plot UMAP
DefaultAssay(ref) = "RNA"
ref = FindNeighbors(ref, reduction = "integrated.rpca.rna", dims = 1:50)
ref = RunUMAP(ref, reduction = 'integrated.rpca.rna', dims = 1:50, assay = 'RNA', reduction.name = 'rna.rpca.umap', reduction.key = 'RPCAUMAP_', return.model = TRUE)
ref = FindClusters(ref, resolution = 0.5, cluster.name = "ref_rpca_clusters")

Afterwards, I mapped each single query batch (remaining 140 batches) on the reference independently using the following code:

# Project each single query batch on reference
query_meta = c()
for(batch_i in batches_query){
  
  # Select current query batch from query list
  query_i = query_list[[batch_i]]
  
  # Process query i 
  query_i = NormalizeData(query_i, assay = "RNA", normalization.method = "LogNormalize")
  query_i = FindVariableFeatures(query_i, assay = "RNA", nfeatures = 2000)
  query_i = ScaleData(query_i, assay = "RNA")
  query_i = RunPCA(query_i, assay = "RNA", reduction.name = 'pca', npcs = 50)
  
  # Map query i on reference
  anchors = FindTransferAnchors(reference = ref, query = query_i, dims = 1:50, reference.reduction = "integrated.rpca.rna")
  query_i = MapQuery(anchorset = anchors, reference = ref, query = query_i, refdata = list(celltype = "ref_rpca_clusters"), reference.reduction = "integrated.rpca.rna", reduction.model = "rna.rpca.umap")
  query_i = AddMetaData(query_i, as.data.frame(query_i@reductions[["ref.umap"]]@cell.embeddings))
  
  query_meta = rbind(query_meta, [email protected])

}

However, when plotting the results of the projection for query batches, I noticed that the predicted clusters are not confined in the same region of the UMAP as in the reference, but they are spread all around the UMAP and when plotting predicted scores, the majority of cells have a very low cluster prediction score, as shown in the following figure (on the left the clusters from the reference, on the right the predicted clusters from the query batches projected on the reference and coloured by cluster prediction score):

What can I do to improve these results? Is there any parameter that I can modify to improve the projection and predictions? Is it actually correct to do this?

Would it be more likely that projected cells on the UMAP are in the correct position but should be assigned the cluster at the location where they are projected on the UMAP or is it more likely that they actually belong to the predicted cluster but they have been projected on a wrong position?

Thank you very much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MapQuery for many cells and batches returns bad cluster prediction results #9467

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

MapQuery for many cells and batches returns bad cluster prediction results #9467

martibonomi Nov 9, 2024

Replies: 0 comments

martibonomi
Nov 9, 2024