Skip to content

2 Proposed SNP Demultiplexing Benchmarking Analysis Plan

drneavin edited this page Mar 31, 2020 · 14 revisions

Main Goals

  1. Identify the SNP demultiplexing software(s) that are the most accurate at identifying singlets and doublets
  2. Identify the combinations of SNP demultiplexing softwares that provide the most accuracy in identifying singlets and doublets
  3. Design and provide software for running final selected method

SNP-based Demultiplexing and Doublet Detection Softwares

Below is a table of the 5 SNP-based demultiplexing and doublet detection softwares currently available as well as the input similarities and differences at the three major steps that are used in all softwares:

  • Pre-processing
  • SNV Identification and Allele Counting
  • Demultiplex and Doublet Identification
Software Pre-Processing SNV Identification and Allele Counting Demultiplex & Doublet Identification
demuxlet
  • Requires vcf of reference SNPs
  • Requires vcf of SNP genotypes for pooled individuals
Allele counts files at known SNP genotype locations in each droplet Calculates multiple likelihoods for each droplet:
  • Probability doublet
  • Probability singlet from each individual
and takes the higher negative log likelihood to assign droplet type
freemuxlet Requires vcf of reference SNPs Allele counts files at predicted SNP genotype locations in each droplet
  • Cluster by SNP genotypes
  • Identify doublets
scSplit
  • Remap with minimap
  • Requires vcf of reference SNPs
  1. Use freebayes to identify SNV locations
  2. Use scSplit script to count alleles at each SNV location for each droplet
  • Cluster cells by genotypes
  • Identify doublets as another cluster
souporcell Requires vcf of reference SNPs
  1. Use freebayes to identify SNV locations
  2. Use Vartrix (10x software) to count alleles at each SNV location for each droplet
  • Cluster cells by genotypes
  • Identify doublets
vireo
  • Requires vcf of reference SNPs
  • Optional vcf of SNP genotypes for pooled individuals
Use cellSNP or vartrix (10x) to produce allele counts
  • Cluster cells by genotypes
  • Identify doublets

General Steps

Below is a figure of the general steps which are explained in detail below the figure.

Demultiplex_ONEK1K_Plan

  1. The 78 pools of PBMCs from >1000 individuals from the ONEK1K project will be used as a basis for testing SNP-based demultiplexing software
  2. The pools will be processed and run per the instructions provided for each software
  3. The droplets identified as singlets (and belonging to the same individual) by all softwares will be selected
  4. The common singlets will be used to simulate new pools of singlets and doublets

Detailed Steps

Below is a figure of the detailed steps which are explained in detail below the figure. Demultiplex_ONEK1K_Plan_Detailed

  1. Produce singularity images for the demultiplexing softwares that will be used - enables consistency
  2. Preprocess the ONEK1K pools per recommendations of each of the softwares - hg38 ONLY
  3. Compare the SNPs identified by each of the different software preprocessing steps
    • Identify the proportion of SNPs that overlap in both softwares
    • Identify the genotype call correlations for the common SNPs across each of the softwares
  4. Use just the SNPs identified by all softwares for identifying singlets and doublets
  5. Run all 78 pools with each SNP demultiplexing software using the SNPs identified in steps 2-4
  6. Select just the droplets that were consistently identified as singlet or doublet and individual (if singlet)
    • Consistent doublets will be used to build a null distribution of UMIs observed per doublet for simulating doublets
    • Droplets consistently identified as singlets by each of the softwares will be used for simulations
  7. Use the droplets consistently identified as singlets (from the same individual) to simulate new pools of individuals
    • Aim for 100 simulations (to start and can expand as needed)
      • Ranging from 8 to 16 individuals per pool
        • 50% to be 12 individuals
        • 35% to be 8 individuals
        • 15% to be 16 individuals
    • Simulate doublets by combining reads from two singlets at expected proportions
      • Assume a 50:50 distribution of reads from each individual for simplicity (long-term goal to investigate the impact of different ratios)
  8. Run the preprocessing for each of the softwares
    • Compare the SNPs identified, and SNP genotype calls across the softwares
  9. Run each of the softwares on each of the simulations (including scrublet for identification of doublets based on transcriptomes)
    • Use the recommended preprocessed input
    • Also run with just the SNPs that were identified by all softwares
    • Long-term goal: convert the allele count pileup files into file formats that can be used by the other softwares
  10. Identify the accuracy of each software by itself at identifying true singlets and doublets
  11. Identify the accuracy of combinations of softwares at identifying true singlets and doublets
  12. Build a software tool that will run each of the softwares and output both results from each and a summary result of the software tools run indicating the singlets, individual assignments and doublets
    • Default of tool will be to run the tools selected (based on results from steps 9-11)
    • Long term goal: Will include options to run selected softwares of interest and output results from each and a summary of the software tools run

Future Analysis Goals - Not Needed to Select Softwares for Consortium Work

These are additional situations and variables that we would like to investigate but will not be required for making a decision about the most appropriate software(s) to be used for identifying singlets, doublet and assigning cells to individuals.

Investigate the impact of the following situations on the ability of the softwares to effectively identify doublets, singlets and assign cells to individuals:

  • Different ratios of RNA from two individuals in doublets
  • Unequal proportions of cells from each individual (ie only 100 cells from one individual but ~2000 from each of the other individuals in the pool)
  • Convert the allele matrices that are used as input for each of the softwares to formats that are used by the other softwares in order to assess the impact of the preprocessing steps
  • Assess potential reasons for failed singlet/doublet identification
    • % Mt
    • % Rb
    • UMIs
    • Gene Counts
  • Assess impact of relatedness of individuals on ability to demultiplex
  • Maximum number of individuals that can be reliably demultiplexed with 20,000 (and more) cells sequenced

Software and Dependency Versions

Singularity bucket built with

Bootstrap: docker
From: conda/miniconda3

popscle (demuxlet and freemuxlet)

  • htslib v1.10.2
  • popscle v1 (demuxlet v2 and freemuxlet v1)

scSplit

  • htslib v1.10.2
  • samtools v1.10
  • bcftools v1.10.2
  • freebayes v1.3.1
  • vcftools v0.1.16
  • python 3.6.8
    • pandas v1.0.3
    • cython v0.29.16
    • numpy v1.18.2
    • pysam v0.15.4
    • PyVCF v0.6.8
    • scikit-learn v0.22.2.post1
    • scipy v1.4.1
    • matplotlib v3.2.1
    • scSplit v1.0.4

souporcell

  • minimap2 v2.7
  • bedtools v2.29.2
  • htslib v1.10.2
  • samtools v1.10
  • bcftools v1.10.2
  • rustc v1.35.0
  • python v3.6.8
    • pysam v0.15.4
    • PyVCF v0.6.8
    • numpy v1.18.2
    • scipy v1.4.1
    • pystan v2.17.1.0
    • pyfaidx v0.5.8
    • cython v0.29.16
  • freebayes v1.3.1
  • vartrix v1.1.3
  • souporcell v0.1.7

vireo

  • python v3.6.8
    • pysam v0.15.4
    • scipy v1.4.1
    • matplotlib v3.2.1
    • cython v0.29.16
  • cellSNP v0.1.7
  • vireoSNP v0.3.1

scrublet

  • python v3.6.8
    • cython v0.29.16
    • numpy v1.18.2
    • scipy v1.4.1
    • scikit-learn v0.22.2.post1
    • scikit-image 0.16.2
    • matplotlib v3.2.1
    • numba v0.48.0
    • pandas v1.0.3
    • annoy v1.16.3
    • umap-learn v0.3.10
    • scrubelt v1 (? installed on 29 March, 2020)