Poor SNP calling performance #937

valeandri · 2025-02-18T15:30:17Z

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.8/docs/FAQ.md:
YES

Describe the issue:
I used the example data described in https://github.com/google/deepvariant/blob/r1.8/docs/deepvariant-pacbio-model-case-study.md and got a very low SNP/INDEL calling performance.

Setup

Operating system: Ubuntu 22.04.5 LTS
DeepVariant version: 1.8
Installation method (Docker, built from source, etc.): built docker from dockerfile, for both cpu and gpu processors
Type of data: HG003 chr20 bam file, reference used GRCh38_no_alt_analysis_set.fasta

Steps to reproduce:

Command on cpu docker image:

time seq 0 {cpus_parallel} | parallel                            \
      --tmpdir $tmpDir/temp -q --halt 2 --line-buffer              \
     /opt/deepvariant/bin/make_examples                                       \
      --sample_name {sample_name}                                  \
      --mode calling                                               \
      --ref $reference_basename                                    \
      --reads $in_basename                                         \
      --regions $target_basename                                   \
      --examples $tmpDir/output/{sample_name}.examples@{cpus}.gz   \
      --checkpoint /opt/models/pacbio                                 \
      --alt_aligned_pileup "diff_channels"                         \
      --call_small_model_examples                                  \
      --max_reads_per_partition "600"                              \
      --min_mapping_quality "1"                                    \
      --parse_sam_aux_fields                                       \
      --partition_size "25000"                                     \
      --phase_reads                                                \
      --pileup_image_width "147"                                   \
      --norealign_reads                                            \
      --small_model_indel_gq_threshold "30"                        \
      --small_model_snp_gq_threshold "25"                          \
      --small_model_vaf_context_window_size "51"                   \
      --sort_by_haplotypes                                         \
      --trained_small_model_path "/opt/smallmodels/pacbio"         \
      --track_ref_reads                                            \
      --trim_reads_for_pileup                                      \
      --vsc_min_fraction_indels "0.12" --task {{}}

Command on gpu docker image:

time /opt/deepvariant/bin/make_examples                                                     \
        --outfile $tmpDir/output/call_variants_output.{sample_name}.tfrecord.gz \
        --examples $in_basename                                                 \
        --checkpoint /opt/models/pacbio 

time /opt/deepvariant/bin/postprocess_variants                                                  \
        --ref $reference_basename                                              \
        --infile $tmpDir/output/call_variants_output.{sample_name}.tfrecord.gz \
        --outfile $tmpDir/output/$out_basename

Error trace: (if applicable)
The resulting hap.py benchmark:
Type Filter METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score
INDEL ALL 0.29347 0.9764 0.694141 0.451297
INDEL PASS 0.29347 0.9764 0.694141 0.451297
SNP ALL 0.01267 0.939577 0.955238 0.025003
SNP PASS 0.01267 0.939577 0.955238 0.025003

What am I missing here?

Valentina

The text was updated successfully, but these errors were encountered:

pichuan · 2025-02-20T16:14:01Z

Hi @valeandri ,

Under your "Command on gpu docker image:" section, your first line was:

time /opt/deepvariant/bin/make_examples                                                     \

I believe that's probably just a typo. It should be call_variants here.

Other than that, I went through your command and it looks reasonable to me.

Given your Recall is surprisingly low, can you share your hap.py command?

And, as another check, can you make sure you run through https://github.com/google/deepvariant/blob/r1.8/docs/deepvariant-pacbio-model-case-study.md and confirm that worked for you? Thanks!

valeandri · 2025-02-20T16:30:56Z

Hi @pichuan,

Thank for looking into that and sorry for the typo. I used call_variants, as you mentioned.

My hap.py command is the following:

FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
INPUT_DIR=/my_home/test_pipeline_hifi/test_workflow/HG003_bench/
OUTPUT_DIR=/my_home/test_pipeline_hifi/test_workflow/HG003_PacBio_chr20_out/
OUTPUT_VCF=HG003_pacbio_chr20_UUID.deepvariant.vcf.gz # result after postprocess_variants
REGION="chr20"
REF_DIR=/my_home/Reference/GRCh38
REF=GRCh38_no_alt_analysis_set.fasta

curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi

TRUTH_VCF="HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
TRUTH_BED="HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"


sudo docker run \
  -v "${INPUT_DIR}":"${INPUT_DIR}" \
  -v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
  -v "${REF_DIR}:/ref" \
  jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
  "${INPUT_DIR}/${TRUTH_VCF}" \
  "${OUTPUT_DIR}/${OUTPUT_VCF}" \
  -f "${INPUT_DIR}/${TRUTH_BED}" \
  -r "/ref/${REF}" \
  -o "${OUTPUT_DIR}/hg003.pacbio.chr20.happy.output" \
  --engine=vcfeval \
  --pass-only \
  -l "${REGION}"

I used the input files from https://github.com/google/deepvariant/blob/r1.8/docs/deepvariant-pacbio-model-case-study.md (HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam) which I realigned with pbmm2 because some contigs were different from my fasta:

index=/my_home/Reference/GRCh38/hg38_no_alt.mmi
tmpDir=/my_home/test_pipeline_hifi/test_workflow/HG003_PacBio_chr20

time pbmm2 align                         \
        --preset HIFI                          \
        --bam-index BAI                        \
        $index                        \
        $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam          \
        $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.realigned.bam

samtools sort $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.realigned.bam  > $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.sort.bam
samtools index $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.sort.bam

I hope I could give you enough information to understand the possible issue.

Valentina

akolesnikov · 2025-02-26T19:28:42Z

Hi @valeandri,

There is one issue I noticed in your postprocessing command: you did not specify --small_model_cvo_records.

DeepVariant uses two models to call variants: a CNN model for the most difficult variants and a "small model" for the less difficult ones. The small model outputs predictions into a separate file. Later, when running postprocessing, you need to specify this file using the --small_model_cvo_records argument.

You can check scripts/run_deepvariant.py to see how all command-line arguments are set up.

Since the small model calls most of the variants, omitting it during postprocessing could explain the poor metrics you observed. Hope that helps!

pichuan self-assigned this Feb 20, 2025

kishwarshafin closed this as completed Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor SNP calling performance #937

Poor SNP calling performance #937

valeandri commented Feb 18, 2025

pichuan commented Feb 20, 2025

valeandri commented Feb 20, 2025 •

edited

Loading

akolesnikov commented Feb 26, 2025

Poor SNP calling performance #937

Poor SNP calling performance #937

Comments

valeandri commented Feb 18, 2025

pichuan commented Feb 20, 2025

valeandri commented Feb 20, 2025 • edited Loading

akolesnikov commented Feb 26, 2025

valeandri commented Feb 20, 2025 •

edited

Loading