Skip to content

v1.10.0-beta

Latest

Choose a tag to compare

@danielecook danielecook released this 09 Oct 17:46
46027eb

This beta release focuses solely on DeepVariant, with no updates for pangenome-aware DeepVariant or DeepTrio. We encourage users to provide feedback, report bugs, and offer suggestions to help us improve.

  • Code is available on the r1.10.0-beta branch.
  • Docker: google/deepvariant:1.10.0-beta
  • Docker (GPU): google/deepvariant:1.10.0-beta-gpu
  • We have updated the metrics page with the latest accuracy / runtime results.

Key updates are detailed below.

Continuous Phasing

It is now possible for DeepVariant to natively emit a phased VCF for long reads (PacBio and ONT), leveraging the long-range information from these reads to accurately phase variants and assign a haplotype.

To enable this feature, you must set the following flags when running with run_deepvariant:

--make_examples_extra_args="phase_reads=true,output_phase_info=true,output_local_read_phasing=/tmp/read-phasing_debug@${N_SHARDS}.tsv" \
--postprocess_variants_extra_args="phased_reads_input_path=/tmp/read-phasing_debug@${N_SHARDS}.tsv"

Make sure that N_SHARDS matches the sharding set globally.

model.example_info.json

Models can now be packaged with an extra file called model.example_info.json which carries the flags needed to generate examples (model inputs) when running inference. Here is an example of what this looks like:

{
  "version": "1.10.0-beta",
  "shape": [100, 147, 10],
  "channels": [1, 2, 3, 4, 5, 6, 7, 26, 9, 10],
  "flags_for_calling": {
    "alt_aligned_pileup": "diff_channels",
    "call_small_model_examples": true,
    "keep_supplementary_alignments": true,
    "max_reads_per_partition": 600,
    "min_mapping_quality": 1,
    "parse_sam_aux_fields": true,
    "partition_size": 25000,
    "phase_reads": true,
    "pileup_image_height": 100,
    "pileup_image_width": 147,
    "realign_reads": false,
    "small_model_indel_gq_threshold": 16,
    "small_model_snp_gq_threshold": 15,
    "small_model_vaf_context_window_size": 51,
    "sort_by_haplotypes": true,
    "track_ref_reads": true,
    "trained_small_model_path": "/opt/smallmodels/pacbio",
    "trim_reads_for_pileup": true,
    "vsc_min_fraction_indels": 0.12
  }

}

The flags used to generate examples are specific to each model, and it is important that they are set correctly for a given model to match the characteristics the model was trained on.

How is model.example_info.json useful?

DeepVariant can be run in two ways. The first way is to use the run_deepvariant command, which automatically sets options and runs each stage of DeepVariant.

The second way is to run these stages (make_examples, call_variants, and postprocess_variants) individually. This method can be significantly faster and more efficient because make_examples and call_variants can be parallelized - even across multiple machines. However, previously, this approach required that the flags for make_examples be set manually, which makes constructing more efficient pipelines tricky. With this change, users can provide the make_examples stage with the --checkpoint flag, and the model_example_info.json flag will be read in and used to set the flags appropriate for the given model.

Using model.example_info.json:

Here is an example illustrating how you could make use this setup:

make_examples \
  --mode calling \
  --ref hg38.fa \
  --reads pacbio_input.bam \
  --examples "[email protected]" \
  --checkpoint "/opt/models/pacbio" \
  --task=1

The logs should report the flags that are then set using model.example_info.json:

[make_examples_core.py:3794] Flags for calling:
alt_aligned_pileup: diff_channels
call_small_model_examples: True
keep_supplementary_alignments: True
…

Docker Images are Streamlined

Docker images have been simplified to have fewer layers and to remove unnecessary files / layers. The table below illustrates the difference in terms of disk size and the number of layers.

Version Size Number of Layers
1.9 6.1GB 114
1.10.0-beta 4.8GB 23

This reduces the size by ~21% and the number of layers by ~80%.

Additional Updates

This list is not exhaustive, and smaller bug fixes and improvements may not be listed here.

  • The ONT model now uses a new input channel, READ_SUPPORTS_VARIANT_FUZZY that indicates support for a variant based on a fuzzy, rather than exact match.
  • The ONT model now sets alt_aligned_pileup=’rows’, meaning that alternative alignments are encoded using additional pileup rows in the model input, rather than additional channels.
  • The PacBio now uses the --keep_supplementary_alignments flag which leads to a slight improvement in accuracy.
  • Tensorflow updated from 2.13.1 to 2.16.1.
  • CUDA has been updated from 11.8 to 12.3 and cuDNN has been updated from 8.6.0 to 8.9.0 in our GPU docker image.
  • Use std::stable_sort instead of std::sort for pileup image rows. This leads to consistent pileup image generation.

Note: Some outputs (e.g. VCF) may still report v1.9 in the header as we did not update all version references.