Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor SNP calling performance #937

Closed
valeandri opened this issue Feb 18, 2025 · 3 comments
Closed

Poor SNP calling performance #937

valeandri opened this issue Feb 18, 2025 · 3 comments
Assignees

Comments

@valeandri
Copy link

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.8/docs/FAQ.md:
YES

Describe the issue:
I used the example data described in https://github.com/google/deepvariant/blob/r1.8/docs/deepvariant-pacbio-model-case-study.md and got a very low SNP/INDEL calling performance.

Setup

  • Operating system: Ubuntu 22.04.5 LTS
  • DeepVariant version: 1.8
  • Installation method (Docker, built from source, etc.): built docker from dockerfile, for both cpu and gpu processors
  • Type of data: HG003 chr20 bam file, reference used GRCh38_no_alt_analysis_set.fasta

Steps to reproduce:

  • Command on cpu docker image:
time seq 0 {cpus_parallel} | parallel                            \
      --tmpdir $tmpDir/temp -q --halt 2 --line-buffer              \
     /opt/deepvariant/bin/make_examples                                       \
      --sample_name {sample_name}                                  \
      --mode calling                                               \
      --ref $reference_basename                                    \
      --reads $in_basename                                         \
      --regions $target_basename                                   \
      --examples $tmpDir/output/{sample_name}.examples@{cpus}.gz   \
      --checkpoint /opt/models/pacbio                                 \
      --alt_aligned_pileup "diff_channels"                         \
      --call_small_model_examples                                  \
      --max_reads_per_partition "600"                              \
      --min_mapping_quality "1"                                    \
      --parse_sam_aux_fields                                       \
      --partition_size "25000"                                     \
      --phase_reads                                                \
      --pileup_image_width "147"                                   \
      --norealign_reads                                            \
      --small_model_indel_gq_threshold "30"                        \
      --small_model_snp_gq_threshold "25"                          \
      --small_model_vaf_context_window_size "51"                   \
      --sort_by_haplotypes                                         \
      --trained_small_model_path "/opt/smallmodels/pacbio"         \
      --track_ref_reads                                            \
      --trim_reads_for_pileup                                      \
      --vsc_min_fraction_indels "0.12" --task {{}}

  • Command on gpu docker image:
time /opt/deepvariant/bin/make_examples                                                     \
        --outfile $tmpDir/output/call_variants_output.{sample_name}.tfrecord.gz \
        --examples $in_basename                                                 \
        --checkpoint /opt/models/pacbio 

time /opt/deepvariant/bin/postprocess_variants                                                  \
        --ref $reference_basename                                              \
        --infile $tmpDir/output/call_variants_output.{sample_name}.tfrecord.gz \
        --outfile $tmpDir/output/$out_basename
  • Error trace: (if applicable)
    The resulting hap.py benchmark:
    Type Filter METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score
    INDEL ALL 0.29347 0.9764 0.694141 0.451297
    INDEL PASS 0.29347 0.9764 0.694141 0.451297
    SNP ALL 0.01267 0.939577 0.955238 0.025003
    SNP PASS 0.01267 0.939577 0.955238 0.025003

What am I missing here?

Valentina

@pichuan
Copy link
Collaborator

pichuan commented Feb 20, 2025

Hi @valeandri ,

Under your "Command on gpu docker image:" section, your first line was:

time /opt/deepvariant/bin/make_examples                                                     \

I believe that's probably just a typo. It should be call_variants here.

Other than that, I went through your command and it looks reasonable to me.

Given your Recall is surprisingly low, can you share your hap.py command?


And, as another check, can you make sure you run through https://github.com/google/deepvariant/blob/r1.8/docs/deepvariant-pacbio-model-case-study.md and confirm that worked for you? Thanks!

@pichuan pichuan self-assigned this Feb 20, 2025
@valeandri
Copy link
Author

valeandri commented Feb 20, 2025

Hi @pichuan,

Thank for looking into that and sorry for the typo. I used call_variants, as you mentioned.

My hap.py command is the following:

FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
INPUT_DIR=/my_home/test_pipeline_hifi/test_workflow/HG003_bench/
OUTPUT_DIR=/my_home/test_pipeline_hifi/test_workflow/HG003_PacBio_chr20_out/
OUTPUT_VCF=HG003_pacbio_chr20_UUID.deepvariant.vcf.gz # result after postprocess_variants
REGION="chr20"
REF_DIR=/my_home/Reference/GRCh38
REF=GRCh38_no_alt_analysis_set.fasta

curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi

TRUTH_VCF="HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
TRUTH_BED="HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"


sudo docker run \
  -v "${INPUT_DIR}":"${INPUT_DIR}" \
  -v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
  -v "${REF_DIR}:/ref" \
  jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
  "${INPUT_DIR}/${TRUTH_VCF}" \
  "${OUTPUT_DIR}/${OUTPUT_VCF}" \
  -f "${INPUT_DIR}/${TRUTH_BED}" \
  -r "/ref/${REF}" \
  -o "${OUTPUT_DIR}/hg003.pacbio.chr20.happy.output" \
  --engine=vcfeval \
  --pass-only \
  -l "${REGION}"

I used the input files from https://github.com/google/deepvariant/blob/r1.8/docs/deepvariant-pacbio-model-case-study.md (HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam) which I realigned with pbmm2 because some contigs were different from my fasta:

index=/my_home/Reference/GRCh38/hg38_no_alt.mmi
tmpDir=/my_home/test_pipeline_hifi/test_workflow/HG003_PacBio_chr20

time pbmm2 align                         \
        --preset HIFI                          \
        --bam-index BAI                        \
        $index                        \
        $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam          \
        $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.realigned.bam

samtools sort $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.realigned.bam  > $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.sort.bam
samtools index $tmpDir/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.sort.bam

I hope I could give you enough information to understand the possible issue.

Valentina

@akolesnikov
Copy link
Collaborator

Hi @valeandri,

There is one issue I noticed in your postprocessing command: you did not specify --small_model_cvo_records.

DeepVariant uses two models to call variants: a CNN model for the most difficult variants and a "small model" for the less difficult ones. The small model outputs predictions into a separate file. Later, when running postprocessing, you need to specify this file using the --small_model_cvo_records argument.

You can check scripts/run_deepvariant.py to see how all command-line arguments are set up.

Since the small model calls most of the variants, omitting it during postprocessing could explain the poor metrics you observed. Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants