2 Proposed SNP Demultiplexing Benchmarking Analysis Plan

Main Goals

Identify the SNP demultiplexing software(s) that are the most accurate at identifying singlets and doublets
Identify the combinations of SNP demultiplexing softwares that provide the most accuracy in identifying singlets and doublets
Design and provide software for running final selected method

SNP-based Demultiplexing and Doublet Detection Softwares

Below is a table of the 5 SNP-based demultiplexing and doublet detection softwares currently available as well as the input similarities and differences at the three major steps that are used in all softwares:

Pre-processing
SNV Identification and Allele Counting
Demultiplex and Doublet Identification

Software	Pre-Processing	SNV Identification and Allele Counting	Demultiplex & Doublet Identification
demuxlet	Requires vcf of reference SNPs Requires vcf of SNP genotypes for pooled individuals	Allele counts files at known SNP genotype locations in each droplet	Calculates multiple likelihoods for each droplet: Probability doublet Probability singlet from each individual and takes the higher negative log likelihood to assign droplet type
freemuxlet	Requires vcf of reference SNPs	Allele counts files at predicted SNP genotype locations in each droplet	Cluster by SNP genotypes Identify doublets
scSplit	Remap with minimap Requires vcf of reference SNPs	Use freebayes to identify SNV locations Use scSplit script to count alleles at each SNV location for each droplet	Cluster cells by genotypes Identify doublets as another cluster
souporcell	Requires vcf of reference SNPs	Use freebayes to identify SNV locations Use Vartrix (10x software) to count alleles at each SNV location for each droplet	Cluster cells by genotypes Identify doublets
vireo	Requires vcf of reference SNPs Optional vcf of SNP genotypes for pooled individuals	Use cellSNP or vartrix (10x) to produce allele counts	Cluster cells by genotypes Identify doublets

General Steps

Below is a figure of the general steps which are explained in detail below the figure.

Demultiplex_ONEK1K_Plan

The 78 pools of PBMCs from >1000 individuals from the ONEK1K project will be used as a basis for testing SNP-based demultiplexing software
The pools will be processed and run per the instructions provided for each software
The droplets identified as singlets (and belonging to the same individual) by all softwares will be selected
The common singlets will be used to simulate new pools of singlets and doublets

Detailed Steps

Below is a figure of the detailed steps which are explained in detail below the figure. Demultiplex_ONEK1K_Plan_Detailed

Produce singularity images for the demultiplexing softwares that will be used - enables consistency
Preprocess the ONEK1K pools per recommendations of each of the softwares - hg38 ONLY
Compare the SNPs identified by each of the different software preprocessing steps
- Identify the proportion of SNPs that overlap in both softwares
- Identify the genotype call correlations for the common SNPs across each of the softwares
Use just the SNPs identified by all softwares for identifying singlets and doublets
Run all 78 pools with each SNP demultiplexing software using the SNPs identified in steps 2-4
Select just the droplets that were consistently identified as singlet or doublet and individual (if singlet)
- Consistent doublets will be used to build a null distribution of UMIs observed per doublet for simulating doublets
- Droplets consistently identified as singlets by each of the softwares will be used for simulations
Use the droplets consistently identified as singlets (from the same individual) to simulate new pools of individuals
- Aim for 100 simulations (to start and can expand as needed)
  - Ranging from 8 to 16 individuals per pool
    - 50% to be 12 individuals
    - 35% to be 8 individuals
    - 15% to be 16 individuals
- Simulate doublets by combining reads from two singlets at expected proportions
  - Assume a 50:50 distribution of reads from each individual for simplicity (long-term goal to investigate the impact of different ratios)
Run the preprocessing for each of the softwares
- Compare the SNPs identified, and SNP genotype calls across the softwares
Run each of the softwares on each of the simulations (including scrublet for identification of doublets based on transcriptomes)
- Use the recommended preprocessed input
- Also run with just the SNPs that were identified by all softwares
- Long-term goal: convert the allele count pileup files into file formats that can be used by the other softwares
Identify the accuracy of each software by itself at identifying true singlets and doublets
Identify the accuracy of combinations of softwares at identifying true singlets and doublets
Build a software tool that will run each of the softwares and output both results from each and a summary result of the software tools run indicating the singlets, individual assignments and doublets
- Default of tool will be to run the tools selected (based on results from steps 9-11)
- Long term goal: Will include options to run selected softwares of interest and output results from each and a summary of the software tools run

Future Analysis Goals - Not Needed to Select Softwares for Consortium Work

These are additional situations and variables that we would like to investigate but will not be required for making a decision about the most appropriate software(s) to be used for identifying singlets, doublet and assigning cells to individuals.

Investigate the impact of the following situations on the ability of the softwares to effectively identify doublets, singlets and assign cells to individuals:

Different ratios of RNA from two individuals in doublets
Unequal proportions of cells from each individual (ie only 100 cells from one individual but ~2000 from each of the other individuals in the pool)
Convert the allele matrices that are used as input for each of the softwares to formats that are used by the other softwares in order to assess the impact of the preprocessing steps
Assess potential reasons for failed singlet/doublet identification
- % Mt
- % Rb
- UMIs
- Gene Counts
Assess impact of relatedness of individuals on ability to demultiplex
Maximum number of individuals that can be reliably demultiplexed with 20,000 (and more) cells sequenced

Software and Dependency Versions

Singularity bucket built with

Bootstrap: docker
From: conda/miniconda3

popscle (demuxlet and freemuxlet)

htslib v1.10.2
popscle v1 (demuxlet v2 and freemuxlet v1)

scSplit

htslib v1.10.2
samtools v1.10
bcftools v1.10.2
freebayes v1.3.1
vcftools v0.1.16
python 3.6.8
- pandas v1.0.3
- cython v0.29.16
- numpy v1.18.2
- pysam v0.15.4
- PyVCF v0.6.8
- scikit-learn v0.22.2.post1
- scipy v1.4.1
- matplotlib v3.2.1
- scSplit v1.0.4

souporcell

minimap2 v2.7
bedtools v2.29.2
htslib v1.10.2
samtools v1.10
bcftools v1.10.2
rustc v1.35.0
python v3.6.8
- pysam v0.15.4
- PyVCF v0.6.8
- numpy v1.18.2
- scipy v1.4.1
- pystan v2.17.1.0
- pyfaidx v0.5.8
- cython v0.29.16
freebayes v1.3.1
vartrix v1.1.3
souporcell v0.1.7

vireo

python v3.6.8
- pysam v0.15.4
- scipy v1.4.1
- matplotlib v3.2.1
- cython v0.29.16
cellSNP v0.1.7
vireoSNP v0.3.1

scrublet

python v3.6.8
- cython v0.29.16
- numpy v1.18.2
- scipy v1.4.1
- scikit-learn v0.22.2.post1
- scikit-image 0.16.2
- matplotlib v3.2.1
- numba v0.48.0
- pandas v1.0.3
- annoy v1.16.3
- umap-learn v0.3.10
- scrubelt v1 (? installed on 29 March, 2020)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly