Skip to content

Commit

Permalink
Merge branch 'release/v5.1.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
ACEnglish committed Feb 5, 2025
2 parents 7a45d98 + 61248ee commit 14fa7b6
Show file tree
Hide file tree
Showing 153 changed files with 9,103 additions and 7,463 deletions.
33 changes: 32 additions & 1 deletion docs/Updates.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,37 @@
# Truvari 5.0
# Truvari 5.1.1
*in progress*

* `bench`
* new automatic hook into the refine step via `truvari bench --refine`
* `refine`
* completely reworked UI in favor of easier whole-genome SV refinement. See wiki for details
* Now writes a consolidated `refine.base.vcf.gz` and `refine.comp.vcf.gz` for easier tracking of variants' final states.
* Default behavior count original variant representations instead of the `phab` variant representations
* `collapse`
* Add `--dup-to-ins`
* Fixed bug where regions with >100 variants would sometimes not have all variants compared
* `--chain` functionality now capped to do only 1 transitive match, preventing uncontrolled over-merging
* `ga4gh`
* New/renamed parameters as part of general improvement work
* Output suffixes are now `.base.vcf.gz` and `.comp.vcf.gz` for consistency.
* `stratify`
* 1--complement` now outputs a single line of total variant counts outside of the regions instead of arbitrarily assigning variants to their nearest region

* misc
* Fix BND bugs
* `pysam.VariantFile.allele_variant_types` falsely identified some BNDs as INDELs, causing incorrect filtering by Truvari
* SVs Decomposed to BNDs strandedness flipped to be more representative of original SV
* unroll seqsim checks all directions
* Match sorting breaks seq/size ties with start/end distance
* Long SV roll limit speeds - ≥500bp, rolling is turned off
* `truvari.VariantRecord.within` edge case fix




# Truvari 5.0
*January 9, 2025*

* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.
Expand Down
24 changes: 20 additions & 4 deletions docs/bench.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,17 +178,33 @@ Truvari can replace the symbolic alt of resolved SVs in the output VCF with the

BND Comparison
==============
Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)`
Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)` Additionally, if the CIPOS and and CIEND info tags are available in the entry, the e.g. POS is further buffered by `-abs(CIPOS[0])` and `+(abs(CIPOS[1])`.

The baseline and comparison BNDs' POS and their joined position must both be within `--bnddist` to be a match candidate (i.e. no partial matches). Furthermore, the direction and strand of the two BNDs must match, for example `t[p[` (piece extending to the right of p is joined after t) only matches with `t[p[` and won't match to `[p[t` (reverse comp piece extending right of p is joined before t).

BND's are annotated in the truvari output with fields: StartDistance (baseline minus comparison POS); EndDistance (baseline minus comparison join position); TruScore which describes the percent of the allowed distance needed to find this match (`(1 - ((abs(StartDistance) + abs(EndDistance)) / 2) / (bnddist*2)) * 100`). For example, two BNDs 20bp apart with bnddist of 100 makes a score of 90.

Another complication for matching BNDs is that they may represent an event which could be 'resolved' in another VCFs. For example, a tandem duplication between `start-end` could be represented as two BNDs of `start to N[{chrom}:{end}[` and `end to ]{self.chrom}:{start}]N`. Therefore, truvari also attempts to compare symbolic alt SVs (ALT = `<DEL>`, `<INV>`, `<DUP>`) to a BND by decomposing the symbolic alt into its breakpoints. These decomposed BNDs are then each checked against a comparison BND and the highest TruScore match kept.
BND comparison can be turned off by setting `--bnddist -1`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.

Note that DUPs are always decomposed to DUP:TANDEM breakpoints. Note that with `--pick single`, a decomposed SV will only match to one BND, so `--pick multi` is recommended to ensure all BNDs will match to a single decomposed SV.
Cross-Representation Matching
=============================

BND comparison can be turned off by setting `--bnddist -1`. Symbolic ALT decomposition can be turned off with `--no-decompose`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.
Truvari considers there to be three possible representation styles of SVs.

1. Resolved: SVs with the full REF and ALT sequences, most frequently representing INS and DEL.
2. Symbolic: SVs without the REF or ALT sequences having an ALT of e.g. `<DEL>, <DUP>`, etc.
3. BNDs: SV breakends represented with the e.g. `t[p[` ALT field.

Comparing SVs across these representation styles have the following caveats:

1. When comparing Resolved and Symbolic SVs, sequence similarity is turned off for thresholding matches. If a user provides a `--reference`, symbolic SVs shorter than the `--max-resolve` parameter (default 25kbp) can be turned into Resolved SVs [details in API docs](https://truvari.readthedocs.io/en/latest/truvari.package.html#truvari.VariantRecord.resolve) and therefore the sequence similarity thresholds are still enforced.
2. When a BND is compared to a with Resolved or Symbolic SV, the SV is 'decomposed' into a set of BNDs and each is compared with the original BND. If any of the decomposed BNDs matches to the original BND, the Resolved/Symbolic SV and BND are considered matching. Details of SV decomposition are [in the API docs](https://truvari.readthedocs.io/en/latest/truvari.package.html#truvari.VariantRecord.decompose)

Note that only Deletions (symbolic or resolved), INV (symbolic or resolved), and symbolic DUPs can be decomposed into BNDs. DUPs are always decomposed into DUP:TANDEM breakends.

Because SVs decompose into multiple BNDs (2 for DEL/DUP, 4 for INV), and because `--pick single` is the default, a decomposed SV will only match to one BND and the BNDs 'mate' will be a FN. To enable all BNDs to match to a decomposed SV, specify `--pick multi`.

SV decomposition into BNDs can be turned off with `--no-decompose`.

Controlling the number of matches
=================================
Expand Down
92 changes: 33 additions & 59 deletions docs/collapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ To start, we merge multiple VCFs (each with their own sample) and ensure there a
```bash
bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz
```
WARNING! If you have symbolic variants, see [the below section](https://github.com/ACEnglish/truvari/wiki/collapse#symbolic-variants) on using bcftools.

This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt.
For example, consider two vcfs:
Expand Down Expand Up @@ -40,6 +41,26 @@ For example, if we collapsed our example merge.vcf by matching any calls within
>> truvari_collapsed.vcf
chr1 7 ... GT ./. 0/1

Symbolic Variants
=================
bcftools may not handle symbolic variants correctly since it doesn't consider their END position. To correct for this, ensure that every input variant has a unique ID and use `bcftools merge -m id`. For example:
```
# A.vcf
chr1 147022730 SV1 N <DEL> . PASS SVLEN=-570334;END=147593064
# B.vcf
chr1 147022730 SV2 N <DEL> . PASS SVLEN=-990414;END=148013144
# bcftools merge -m none A.vcf B.vcf
# Premature collapse
chr1 147022730 SV1;SV2 N <DEL> . PASS SVLEN=-570334;END=147593064
# bcftools merge -m id A.vcf B.vcf
chr1 147022730 SV1 N <DEL> . PASS SVLEN=-570334;END=147593064
chr1 147022730 SV2 N <DEL> . PASS SVLEN=-990414;END=148013144
```

This bug has been replicated with bcftools 1.18 and 1.21.

--choose behavior
=================
When collapsing, the default `--choose` behavior is to take the `first` variant by position from a cluster to
Expand Down Expand Up @@ -89,18 +110,22 @@ will become:
Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one.
For example, if we have

chr1 5 ..
chr1 1 ..
chr1 4 ..
chr1 7 ..
chr1 9 ..
chr1 10 ..

When we collapse anything within 2bp of each other, without `--chain`, we output:
We take the `chr1 1` variant and find all its matches. When we collapse anything within 5bp of each other, without `--chain`, we output:

chr1 5 ..
chr1 9 ..
chr1 1 ..
chr1 7 ..

With `--chain`, we would allow one level of transitive matching. This means that after finding the `chr1 1 -> chr1 4` match, we check `chr1 4` against all the remaining variants and would output

With `--chain`, we would collapse `chr1 9` as well, producing
chr1 1 ..
chr1 10 ..

chr1 5 ..
Note that this leaves `chr1 10` because we don't do multiple levels of transitive matching, meaning we never compare `chr1 7` to `chr1 10`. This is preferred because otherwise variants which have a continuous range of similarity could all be collapsed into a single variant. e.g., if the position in this example were sizes and, we wouldn't want the 1bp variant being a kept representation for all the variants.

Annotations
===========
Expand All @@ -111,55 +136,4 @@ The output file has only two annotations added to the `INFO`.
- `NumCollapsed` - Number of variants collapsed into this variant
- `NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes

The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details.

```
usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug]
[-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain]
[--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly]
Structural variant collapser
Will collapse all variants within sizemin/max that match over thresholds
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Comparison set of calls
-o OUTPUT, --output OUTPUT
Output vcf (stdout)
-c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT
Where collapsed variants are written (collapsed.vcf)
-f REFERENCE, --reference REFERENCE
Indexed fasta used to call variants
-k {first,maxqual,common}, --keep {first,maxqual,common}
When collapsing calls, which one to keep (first)
--debug Verbose logging
--hap Collapsing a single individual's haplotype resolved calls (False)
--chain Chain comparisons to extend possible collapsing (False)
--no-consolidate Skip consolidation of sample genotype fields (True)
--null-consolidate NULL_CONSOLIDATE
Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null
from all neighbors (None)
Comparison Threshold Arguments:
-r REFDIST, --refdist REFDIST
Max reference location distance (500)
-p PCTSIM, --pctsim PCTSIM
Min percent allele sequence similarity. Set to 0 to ignore. (0.95)
-B MINHAPLEN, --minhaplen MINHAPLEN
Minimum haplotype sequence length to create (50)
-P PCTSIZE, --pctsize PCTSIZE
Min pct allele size similarity (minvarsize/maxvarsize) (0.95)
-O PCTOVL, --pctovl PCTOVL
Min pct reciprocal overlap (0.0) for DEL events
-t, --typeignore Variant types don't need to match to compare (False)
--use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False)
Filtering Arguments:
-s SIZEMIN, --sizemin SIZEMIN
Minimum variant size to consider for comparison (50)
-S SIZEMAX, --sizemax SIZEMAX
Maximum variant size to consider for comparison (50000)
--passonly Only consider calls with FILTER == PASS
```
The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details.
Loading

0 comments on commit 14fa7b6

Please sign in to comment.