Skip to content

Commit

Permalink
Merge branch 'release/v5.0.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
ACEnglish committed Jan 10, 2025
2 parents e6a49e7 + 7380b25 commit f90f194
Show file tree
Hide file tree
Showing 748 changed files with 42,105 additions and 32,998 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@
[![pylint](imgs/pylint.svg)](https://github.com/acenglish/truvari/actions/workflows/pylint.yml)
[![FuncTests](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
[![coverage](imgs/coverage.svg)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v4.3.0)](https://github.com/ACEnglish/truvari/compare/v4.3.0...develop)
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v4.3.1)](https://github.com/ACEnglish/truvari/compare/v4.3.1...develop)
[![Downloads](https://static.pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)

![Logo](https://raw.githubusercontent.com/ACEnglish/truvari/develop/imgs/BoxScale1_DarkBG.png)
Toolkit for benchmarking, merging, and annotating Structural Variants

📚 [WIKI page](https://github.com/acenglish/truvari/wiki) has detailed documentation.
📚 [WIKI page](https://github.com/acenglish/truvari/wiki) has detailed user documentation.
🛠️ [Developer Docs](https://truvari.readthedocs.io/en/latest/) for the truvari API.
📈 See [Updates](https://github.com/acenglish/truvari/wiki/Updates) on new versions.
📝 Read our Papers ([#1](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02840-6), [#2](https://rdcu.be/dFQNN)) to learn more.

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference which can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.

If you find `truvari collapse` 'freezing' during processing where it is no longer writing variants to the output VCFs, you should investigate if there are regions which should be excluded from analysis. To do this, first run:

```
truvari anno chunks input.vcf.gz > counts.bed
```

The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 3rd column has the number of SVs inside the chunk while the 4th and 5th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 4th is after separating by svtype and svlen, the 5th is re-chunking the svtype/svlen separation by distance.
The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 4th column has the number of SVs inside the chunk while the 5th and 6th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 5th is after separating by svtype and svlen, the 6th is re-chunking the svtype/svlen separation by distance.

If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset the regions of interest and then use `bedtools` to create a bed file that will skip them.
If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset to the regions of interest and then use `bedtools` to create a bed file that will skip them.

```
# exclude regions with more than 30k SVs. You should test this threshold.
Expand All @@ -18,6 +18,6 @@ bedtools complement -g genome.bed -i to_exclude.bed > to_analyize.bed
truvari collapse --bed to_analyze.bed -i input.vcf.gz
```

When considering what threshold you would like to use, just looking at the 3rd column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if there are 100k SVs in a window, but they all could collapse, it would only take O(N - 1) comparisons to perform the collapse. The problems arise when the 100k SVs have few redundant SVs and therefore requires O(N**2) comparisons.
When considering what threshold you would like to use, just looking at the 4th column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if all SVs in a chunk could collapse, it would only take O(N - 1) comparisons to complete the collapse. The problems arise when the SVs have few redundant SVs and therefore requires O(N**2) comparisons.

A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded an a collapsing could be repeated until success.
A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded and collapsing repeated until success.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
15 changes: 15 additions & 0 deletions docs/v4.3.1/Updates.md → docs/Updates.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# Truvari 5.0
*in progress*

* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.
* Symbolic variant sub-types are ignored e.g. `<DUP:TANDEM> == <DUP>`
* `--sizemax` now default to `-1`, meaning all variant ≥ `--sizemin / --sizefilt` are compared
* Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (`--removed-output` instead of `--collapsed-output`)
* Fixed 'Unknown error' caused by unset TMPDIR ([#229](https://github.com/ACEnglish/truvari/issues/229) and [#245](https://github.com/ACEnglish/truvari/issues/245))
* Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved
* BND variants are now compared by bench ([details](https://github.com/ACEnglish/truvari/wiki/bench#bnd-comparison))
* Cleaner outputs by not writing matching annotations (e.g. `PctSeqSimilarity`) that are `None`
* Major refactor of Truvari package API for easy reuse of SV comparison functions ([details](https://truvari.readthedocs.io/en/latest/))

# Truvari 4.3.1
*September 9, 2024*
* `bench`
Expand Down
8 changes: 7 additions & 1 deletion docs/v4.2.2/anno.md → docs/anno.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants
### Details
Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via:
```bash
zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
zcat adotto_TRregions_v1.2.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
tabix anno.trf.bed.gz
```
For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)].
Expand Down Expand Up @@ -116,6 +116,12 @@ options:
--debug Verbose logging
```

### Notes about the Adotto catalog
If using an older version of Truvari (≤4.3.1) or adotto catalog v1.2, you may have incompatibility issues due to the 'repeat' field being named 'motif'. This can be fixed by running:
```
zcat adotto_TRregions_v1.2.bed.gz | cut -f1-3,18 | sed 's/"\[/[/g; s/"$//; s/""/"/g; s/"motif":/"repeat":/g' | bgzip > anno.trf.bed.gz
```

# truvari anno grm

For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints.
Expand Down
2 changes: 1 addition & 1 deletion docs/api/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# -- Project information -----------------------------------------------------

project = 'Truvari'
copyright = '2023, Adam English'
copyright = '2025, Adam English'
author = 'Adam English'


Expand Down
3 changes: 2 additions & 1 deletion docs/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ https://github.com/ACEnglish/truvari/wiki
:maxdepth: 2
:caption: Contents:

truvari
truvari.examples
truvari.package


Indices and tables
Expand Down
112 changes: 112 additions & 0 deletions docs/api/truvari.examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
Truvari API QuickStart
======================

Truvari provides functionality to facilitate the comparison of structural variants (SVs) in VCF variant records. Developers can easily leverage this functionality by replacing calls to `pysam.VariantFile` with `truvari.VariantFile`. The `truvari.VariantFile` retains all `pysam` functionality.

.. code-block:: python
import truvari
vcf = truvari.VariantFile("input.vcf.gz")
for entry in vcf:
# Access variant's INFO fields using pysam
if 'SVTYPE' in entry.info and 'SVLEN' in entry.info:
print(entry.info['SVTYPE'], entry.info['SVLEN'])
# But these INFOs aren't always available.
# Access type/size properties of variants using truvari
print(entry.var_type(), entry.var_size())
# Access genotype (GT)
# using pysam ->
if 'SAMPLE' in entry.samples and 'GT' in entry.samples['SAMPLE']:
print(entry.samples['SAMPLE']['GT'])
# using truvari ->
print(entry.gt('SAMPLE'))
# Calculate variant's allele frequency with truvari
print(entry.allele_freq_annos())
Details of all available functions are in :ref:`package documentation <variant_handling>`.

Comparing Variants
------------------

The `truvari.VariantRecord` simplifies comparing two VCF entries.

.. code-block:: python
# Given two `truvari.VariantRecords`, entry1 and entry2
match = entry1.match(entry2)
print("Entries' Sequence Similarity:", match.seqsim)
print("Entries' Size Similarity:", match.sizesim)
print("Is the match above thresholds:", match.state)
This returns a :ref:`truvari.MatchResult <match_result>`. You can customize matching thresholds by providing :ref:`truvari.VariantParams <variant_params>` to the `truvari.VariantFile`.

.. code-block:: python
# Disable sequence and size similarity; enable reciprocal overlap
p = truvari.VariantParams(pctseq=0, pctsize=0, pctovl=0.5)
vcf = truvari.VariantFile("input.vcf.gz", params=p)
entry1 = next(vcf)
entry2 = next(vcf)
match = entry1.match(entry2)
Filtering Variants
------------------

The `truvari.VariantParams` provides parameters for filtering variants, such as minimum or maximum SV sizes.

.. code-block:: python
p = truvari.VariantParams(sizemin=200, sizemax=500)
vcf = truvari.VariantFile("input.vcf.gz", params=p)
# Retrieve all variant records within sizemin and sizemax
results = [entry for entry in vcf if not entry.filter_size()]
Additional filters, such as excluding monomorphic reference sites or single-end BNDs, can be applied using `entry.filter_call()`.

Subsetting to Regions
---------------------

To subset a VCF to regions specified in a BED file, use:

.. code-block:: python
for entry in vcf.fetch_bed("regions.bed"):
print("Entry's variant type:", entry.var_type())
print("Entry's variant size:", entry.var_size())
If your regions of interest are stored in an in-memory object instead of a BED file, use the `.fetch_regions` method:

.. code-block:: python
from collections import defaultdict
from pyintervaltree import IntervalTree
tree = defaultdict(IntervalTree)
tree['chr1'].addi(10, 100)
tree['chr2'].addi(2000, 2200)
count = 0
for entry in vcf.fetch_regions(tree):
count += 1
print(f"Total of {count} variants")
To iterate over variants that are not within the regions, use `vcf.fetch_regions(tree, inside=False)`. Both of these
fetch methods use heuristics to choose the more efficient fetching strategy of either seeking through the VCF file or
streaming the entire file.

Parsing BND Information
-----------------------

Truvari also simplifies parsing BND information from VCF entries:

.. code-block:: python
# Example entry:
# chr1 23272628 SV_1 G G]chr5:52747359] . PASS SVTYPE=BND;EVENTTYPE=TRA:UNBALANCED;SUBCLONAL=n;COMPLEX=n;MATEID=SV_171 GT:PSL:PSO 0/1:.:.
print(entry.bnd_position())
# ('chr5', 52747359)
print(entry.bnd_direction_strand())
# ('right', 'direct')
164 changes: 164 additions & 0 deletions docs/api/truvari.package.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
Truvari package
===============

Overview
--------
.. automodule:: truvari
:members:
:undoc-members:
:show-inheritance:

.. _variant_handling:

Variant Handling
----------------

.. autoclass:: VariantFile
:members:

.. autoclass:: VariantRecord
:members:

.. _variant_params:

.. autoclass:: VariantParams
:members:

Objects
-------

.. _match_result:

.. autoclass:: MatchResult
:members:

.. autoclass:: GT
:members:

.. autoclass:: SV
:members:

.. autoclass:: Bench
:members:

.. autoclass:: BenchOutput
:members:

.. autoclass:: StatsBox
:members:

.. autoclass:: LogFileStderr
:members:

Extra Methods
-------------
.. autofunction:: bed_ranges

.. autofunction:: benchdir_count_entries

.. autofunction:: best_seqsim

.. autofunction:: build_region_tree

.. autofunction:: read_bed_tree

.. autofunction:: check_vcf_index

.. autofunction:: chunker

.. autofunction:: cmd_exe

.. autofunction:: compress_index_vcf

.. autofunction:: coords_within

.. autofunction:: count_entries

.. autofunction:: extend_region_tree

.. autofunction:: file_zipper

.. autofunction:: help_unknown_cmd

.. autofunction:: get_gt

.. autofunction:: get_scalebin

.. autofunction:: get_sizebin

.. autofunction:: get_svtype

.. autofunction:: make_temp_filename

.. autofunction:: merge_region_tree_overlaps

.. autofunction:: msa2vcf

.. autofunction:: opt_gz_open

.. autofunction:: optimize_df_memory

.. autofunction:: overlap_percent

.. autofunction:: overlaps

.. autofunction:: performance_metrics

.. autofunction:: phab

.. autofunction:: reciprocal_overlap

.. autofunction:: restricted_float

.. autofunction:: restricted_int

.. autofunction:: ref_ranges

.. autofunction:: roll_seqsim

.. autofunction:: seqsim

.. autofunction:: setup_logging

.. autofunction:: sizesim

.. autofunction:: unroll_seqsim

.. autofunction:: vcf_ranges

.. autofunction:: vcf_to_df

Data
----
HEADERMAT
^^^^^^^^^
regular expression of vcf header INFO/FORMAT fields with groups

.. autodata:: HEADERMAT


QUALBINS
^^^^^^^^
0-100 quality score bin strings (step size 10)

.. autodata:: QUALBINS

SVTYTYPE
^^^^^^^^
:class:`pandas.CategoricalDtype` of :class:`truvari.SV`

SZBINMAX
^^^^^^^^
integer list of maximum size for size bins

.. autodata:: SZBINMAX

SZBINS
^^^^^^
string list of size bins

.. autodata:: SZBINS

SZBINTYPE
^^^^^^^^^
:class:`pandas.CategoricalDtype` of :data:`truvari.SZBINS`
Loading

0 comments on commit f90f194

Please sign in to comment.