Merge branch 'release/v5.0.0'

ACEnglish · Jan 10, 2025 · f90f194 · f90f194
2 parents e6a49e7 + 7380b25
commit f90f194
Show file tree

Hide file tree

Showing 748 changed files with 42,105 additions and 32,998 deletions.
diff --git a/README.md b/README.md
@@ -2,13 +2,14 @@
 [![pylint](imgs/pylint.svg)](https://github.com/acenglish/truvari/actions/workflows/pylint.yml)
 [![FuncTests](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
 [![coverage](imgs/coverage.svg)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
-[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v4.3.0)](https://github.com/ACEnglish/truvari/compare/v4.3.0...develop)
+[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v4.3.1)](https://github.com/ACEnglish/truvari/compare/v4.3.1...develop)
 [![Downloads](https://static.pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)
 
 ![Logo](https://raw.githubusercontent.com/ACEnglish/truvari/develop/imgs/BoxScale1_DarkBG.png)  
 Toolkit for benchmarking, merging, and annotating Structural Variants
 
-📚 [WIKI page](https://github.com/acenglish/truvari/wiki) has detailed documentation.  
+📚 [WIKI page](https://github.com/acenglish/truvari/wiki) has detailed user documentation.  
+🛠️ [Developer Docs](https://truvari.readthedocs.io/en/latest/) for the truvari API.  
 📈 See [Updates](https://github.com/acenglish/truvari/wiki/Updates) on new versions.  
 📝 Read our Papers ([#1](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02840-6), [#2](https://rdcu.be/dFQNN)) to learn more.
 

diff --git a/docs/v4.0.0/Citations.md → docs/Citations.md b/docs/v4.0.0/Citations.md → docs/Citations.md
diff --git a/...-on-Regions-with-a-High‐Density-of-SVs.md → ...-on-Regions-with-a-High‐Density-of-SVs.md b/...-on-Regions-with-a-High‐Density-of-SVs.md → ...-on-Regions-with-a-High‐Density-of-SVs.md
@@ -1,14 +1,14 @@
-Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
+Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference which can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
 
 If you find `truvari collapse` 'freezing' during processing where it is no longer writing variants to the output VCFs, you should investigate if there are regions which should be excluded from analysis. To do this, first run:
 
 ```
 truvari anno chunks input.vcf.gz > counts.bed
 ```
 
-The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 3rd column has the number of SVs inside the chunk while the 4th and 5th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 4th is after separating by svtype and svlen, the 5th is re-chunking the svtype/svlen separation by distance.
+The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 4th column has the number of SVs inside the chunk while the 5th and 6th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 5th is after separating by svtype and svlen, the 6th is re-chunking the svtype/svlen separation by distance.
 
-If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset the regions of interest and then use `bedtools` to create a bed file that will skip them.
+If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset to the regions of interest and then use `bedtools` to create a bed file that will skip them.
 
 ```
 # exclude regions with more than 30k SVs. You should test this threshold.
@@ -18,6 +18,6 @@ bedtools complement -g genome.bed -i to_exclude.bed > to_analyize.bed
 truvari collapse --bed to_analyze.bed -i input.vcf.gz
 ```
 
-When considering what threshold you would like to use, just looking at the 3rd column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if there are 100k SVs in a window, but they all could collapse, it would only take O(N - 1) comparisons to perform the collapse. The problems arise when the 100k SVs have few redundant SVs and therefore requires O(N**2) comparisons.
+When considering what threshold you would like to use, just looking at the 4th column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if all SVs in a chunk could collapse, it would only take O(N - 1) comparisons to complete the collapse. The problems arise when the SVs have few redundant SVs and therefore requires O(N**2) comparisons.
 
-A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded an a collapsing could be repeated until success. 
+A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded and collapsing repeated until success. 
diff --git a/docs/v4.0.0/Development.md → docs/Development.md b/docs/v4.0.0/Development.md → docs/Development.md
diff --git a/docs/v4.3.1/Home.md → docs/Home.md b/docs/v4.3.1/Home.md → docs/Home.md
diff --git a/docs/v4.3.0/Installation.md → docs/Installation.md b/docs/v4.3.0/Installation.md → docs/Installation.md
diff --git a/docs/v4.2.0/MatchIds.md → docs/MatchIds.md b/docs/v4.2.0/MatchIds.md → docs/MatchIds.md
diff --git a/docs/v4.0.0/Multi-allelic-VCFs.md → docs/Multi-allelic-VCFs.md b/docs/v4.0.0/Multi-allelic-VCFs.md → docs/Multi-allelic-VCFs.md
diff --git a/docs/v4.3.1/Updates.md → docs/Updates.md b/docs/v4.3.1/Updates.md → docs/Updates.md
@@ -1,3 +1,18 @@
+# Truvari 5.0
+*in progress*
+
+* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
+* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
+* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.
+* Symbolic variant sub-types are ignored e.g. `<DUP:TANDEM> == <DUP>`
+* `--sizemax` now default to `-1`, meaning all variant ≥ `--sizemin / --sizefilt` are compared
+* Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (`--removed-output` instead of `--collapsed-output`)
+* Fixed 'Unknown error' caused by unset TMPDIR ([#229](https://github.com/ACEnglish/truvari/issues/229) and [#245](https://github.com/ACEnglish/truvari/issues/245))
+* Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved
+* BND variants are now compared by bench ([details](https://github.com/ACEnglish/truvari/wiki/bench#bnd-comparison))
+* Cleaner outputs by not writing matching annotations (e.g. `PctSeqSimilarity`) that are `None`
+* Major refactor of Truvari package API for easy reuse of SV comparison functions ([details](https://truvari.readthedocs.io/en/latest/))
+
 # Truvari 4.3.1
 *September 9, 2024*
 * `bench`

diff --git a/docs/v4.2.2/anno.md → docs/anno.md b/docs/v4.2.2/anno.md → docs/anno.md
@@ -74,7 +74,7 @@ Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants
 ### Details
 Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via:
 ```bash
-zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
+zcat adotto_TRregions_v1.2.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
 tabix anno.trf.bed.gz
 ```
 For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. 
@@ -116,6 +116,12 @@ options:
   --debug               Verbose logging
 ```
 
+### Notes about the Adotto catalog
+If using an older version of Truvari (≤4.3.1) or adotto catalog v1.2, you may have incompatibility issues due to the 'repeat' field being named 'motif'. This can be fixed by running:
+```
+zcat adotto_TRregions_v1.2.bed.gz | cut -f1-3,18 | sed 's/"\[/[/g; s/"$//; s/""/"/g; s/"motif":/"repeat":/g' | bgzip > anno.trf.bed.gz
+```
+
 # truvari anno grm
 
 For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. 

diff --git a/docs/api/conf.py b/docs/api/conf.py
@@ -17,7 +17,7 @@
 # -- Project information -----------------------------------------------------
 
 project = 'Truvari'
-copyright = '2023, Adam English'
+copyright = '2025, Adam English'
 author = 'Adam English'
 
 

diff --git a/docs/api/index.rst b/docs/api/index.rst
@@ -17,7 +17,8 @@ https://github.com/ACEnglish/truvari/wiki
    :maxdepth: 2
    :caption: Contents:
 
-   truvari
+   truvari.examples
+   truvari.package
 
 
 Indices and tables

diff --git a/docs/api/truvari.examples.rst b/docs/api/truvari.examples.rst
@@ -0,0 +1,112 @@
+Truvari API QuickStart
+======================
+
+Truvari provides functionality to facilitate the comparison of structural variants (SVs) in VCF variant records. Developers can easily leverage this functionality by replacing calls to `pysam.VariantFile` with `truvari.VariantFile`. The `truvari.VariantFile` retains all `pysam` functionality.
+
+.. code-block:: python
+
+    import truvari
+    vcf = truvari.VariantFile("input.vcf.gz")
+    for entry in vcf:
+        # Access variant's INFO fields using pysam
+        if 'SVTYPE' in entry.info and 'SVLEN' in entry.info:
+            print(entry.info['SVTYPE'], entry.info['SVLEN'])
+
+        # But these INFOs aren't always available.
+        # Access type/size properties of variants using truvari
+        print(entry.var_type(), entry.var_size())
+
+        # Access genotype (GT)
+        # using pysam ->
+        if 'SAMPLE' in entry.samples and 'GT' in entry.samples['SAMPLE']:
+            print(entry.samples['SAMPLE']['GT'])
+
+        # using truvari ->
+        print(entry.gt('SAMPLE'))
+
+        # Calculate variant's allele frequency with truvari
+        print(entry.allele_freq_annos())
+
+Details of all available functions are in :ref:`package documentation <variant_handling>`.
+
+Comparing Variants
+------------------
+
+The `truvari.VariantRecord` simplifies comparing two VCF entries.
+
+.. code-block:: python
+
+    # Given two `truvari.VariantRecords`, entry1 and entry2
+    match = entry1.match(entry2)
+    print("Entries' Sequence Similarity:", match.seqsim)
+    print("Entries' Size Similarity:", match.sizesim)
+    print("Is the match above thresholds:", match.state)
+
+This returns a :ref:`truvari.MatchResult <match_result>`. You can customize matching thresholds by providing :ref:`truvari.VariantParams <variant_params>` to the `truvari.VariantFile`.
+
+.. code-block:: python
+
+    # Disable sequence and size similarity; enable reciprocal overlap
+    p = truvari.VariantParams(pctseq=0, pctsize=0, pctovl=0.5)
+    vcf = truvari.VariantFile("input.vcf.gz", params=p)
+    entry1 = next(vcf)
+    entry2 = next(vcf)
+    match = entry1.match(entry2)
+
+Filtering Variants
+------------------
+
+The `truvari.VariantParams` provides parameters for filtering variants, such as minimum or maximum SV sizes.
+
+.. code-block:: python
+
+    p = truvari.VariantParams(sizemin=200, sizemax=500)
+    vcf = truvari.VariantFile("input.vcf.gz", params=p)
+    # Retrieve all variant records within sizemin and sizemax
+    results = [entry for entry in vcf if not entry.filter_size()]
+
+Additional filters, such as excluding monomorphic reference sites or single-end BNDs, can be applied using `entry.filter_call()`.
+
+Subsetting to Regions
+---------------------
+
+To subset a VCF to regions specified in a BED file, use:
+
+.. code-block:: python
+
+    for entry in vcf.fetch_bed("regions.bed"):
+        print("Entry's variant type:", entry.var_type())
+        print("Entry's variant size:", entry.var_size())
+
+If your regions of interest are stored in an in-memory object instead of a BED file, use the `.fetch_regions` method:
+
+.. code-block:: python
+
+    from collections import defaultdict
+    from pyintervaltree import IntervalTree
+    tree = defaultdict(IntervalTree)
+    tree['chr1'].addi(10, 100)
+    tree['chr2'].addi(2000, 2200)
+    count = 0
+    for entry in vcf.fetch_regions(tree):
+        count += 1
+    print(f"Total of {count} variants")
+
+To iterate over variants that are not within the regions, use `vcf.fetch_regions(tree, inside=False)`. Both of these
+fetch methods use heuristics to choose the more efficient fetching strategy of either seeking through the VCF file or
+streaming the entire file.
+
+Parsing BND Information
+-----------------------
+
+Truvari also simplifies parsing BND information from VCF entries:
+
+.. code-block:: python
+
+    # Example entry:
+    # chr1  23272628  SV_1  G  G]chr5:52747359]  .  PASS  SVTYPE=BND;EVENTTYPE=TRA:UNBALANCED;SUBCLONAL=n;COMPLEX=n;MATEID=SV_171  GT:PSL:PSO  0/1:.:.
+    print(entry.bnd_position())
+    # ('chr5', 52747359)
+    print(entry.bnd_direction_strand())
+    # ('right', 'direct')
+
diff --git a/docs/api/truvari.package.rst b/docs/api/truvari.package.rst
@@ -0,0 +1,164 @@
+Truvari package
+===============
+
+Overview
+--------
+.. automodule:: truvari
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+.. _variant_handling:
+
+Variant Handling
+----------------
+
+.. autoclass:: VariantFile
+   :members:
+
+.. autoclass:: VariantRecord
+   :members:
+
+.. _variant_params:
+
+.. autoclass:: VariantParams
+   :members:
+
+Objects
+-------
+
+.. _match_result:
+
+.. autoclass:: MatchResult
+   :members:
+
+.. autoclass:: GT
+   :members:
+
+.. autoclass:: SV
+   :members:
+
+.. autoclass:: Bench
+   :members:
+
+.. autoclass:: BenchOutput
+   :members:
+
+.. autoclass:: StatsBox
+   :members:
+
+.. autoclass:: LogFileStderr
+   :members:
+
+Extra Methods
+-------------
+.. autofunction:: bed_ranges
+
+.. autofunction:: benchdir_count_entries
+
+.. autofunction:: best_seqsim
+
+.. autofunction:: build_region_tree
+
+.. autofunction:: read_bed_tree
+
+.. autofunction:: check_vcf_index
+
+.. autofunction:: chunker
+
+.. autofunction:: cmd_exe
+
+.. autofunction:: compress_index_vcf
+
+.. autofunction:: coords_within
+
+.. autofunction:: count_entries
+
+.. autofunction:: extend_region_tree
+
+.. autofunction:: file_zipper
+
+.. autofunction:: help_unknown_cmd
+
+.. autofunction:: get_gt
+
+.. autofunction:: get_scalebin
+
+.. autofunction:: get_sizebin
+
+.. autofunction:: get_svtype
+
+.. autofunction:: make_temp_filename
+
+.. autofunction:: merge_region_tree_overlaps
+
+.. autofunction:: msa2vcf
+
+.. autofunction:: opt_gz_open
+
+.. autofunction:: optimize_df_memory
+
+.. autofunction:: overlap_percent
+
+.. autofunction:: overlaps
+
+.. autofunction:: performance_metrics
+
+.. autofunction:: phab
+
+.. autofunction:: reciprocal_overlap
+
+.. autofunction:: restricted_float
+
+.. autofunction:: restricted_int
+
+.. autofunction:: ref_ranges
+
+.. autofunction:: roll_seqsim
+
+.. autofunction:: seqsim
+
+.. autofunction:: setup_logging
+
+.. autofunction:: sizesim
+
+.. autofunction:: unroll_seqsim
+
+.. autofunction:: vcf_ranges
+
+.. autofunction:: vcf_to_df
+
+Data
+----
+HEADERMAT
+^^^^^^^^^
+regular expression of vcf header INFO/FORMAT fields with groups
+
+.. autodata:: HEADERMAT
+
+
+QUALBINS
+^^^^^^^^
+0-100 quality score bin strings (step size 10)
+
+.. autodata:: QUALBINS
+
+SVTYTYPE
+^^^^^^^^
+:class:`pandas.CategoricalDtype` of :class:`truvari.SV`
+
+SZBINMAX
+^^^^^^^^
+integer list of maximum size for size bins
+
+.. autodata:: SZBINMAX
+
+SZBINS
+^^^^^^
+string list of size bins
+
+.. autodata:: SZBINS
+
+SZBINTYPE
+^^^^^^^^^
+:class:`pandas.CategoricalDtype` of :data:`truvari.SZBINS`