-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF4.4 SVLEN requirement across different variant representations #769
Comments
Yes. There are indeed two definitions: For <DEL>, We could have split it up into SVLEN defining END (thus always SVLEN=1 for <INS>) and had a separate field (e.g.
Almost. Your example is a bit unusual in that you're defining a novel TR insertion. Generally speaking TR callers report expansion/contraction of existing TRs and the The other issues is you've done is defined both a sequence allele and a symbolic <INS> allele for the same variant. Copy number records are in their own category so you can write a
Technically it's not enforced - the wording is "should" not "must". If a implementation-defined field has a meaningful implementation-defined interpretation of SVLEN then the specs will allow it. If you've got a file that has SVLEN defined then it's not an invalid VCF, it's just not following the recommendations (There's draft 'SAM/VCF strict' specs designed as a set of validation rules to highlight issues with technically-compliant SAM/VCF file but there's still sitting as a PR as nobody's writing a validator that would use them. ^ Up to VCFv4.3 SVLEN was defined as the length difference between REF and ALT so was meaningless for <INV> and in practice every caller that reported <INV> used the VCFv4.4 redefinition as the length of the SV anyway. |
@d-cameron Can you please point to the issue/pull request that discusses deprecation of END? I completely missed it. I see two problems with it:
The backward-incompatibility is what worries me more. What is the advantage of deprecating it? I understand the desire to remove certain amount of redundancy; however, existing programs will stop functioning with such files. |
@davmlaw Sure. But that's a commit, not a discussion. An important decision like this should be discussed publicly. |
There were lots of discussions on this that went on for months, plus long GA4GH file formats committee discussions over zoom. Also see #758 I would urge you to take part in the discussions and track the VCF PRs here if you wish to be kept in the loop on upcoming changes. |
As for why it was done, the fact is END was pretty broken when there was more than one sample, as every sample could have its own end. It's never really worked well, and this isn't just an SV issue (although it gets worse there). If I recall the policy was END would be for indexing only, representing the largest size, with the expectation that tools will have to post-filter if they wish to do sample specific queries within a region as the format simply disallows for that to be done correctly when using INFO/END. I'm not that familier with all the ins and outs though so mostly left the discussions to people better informed. (Hopefully they will correct me if my recollection is wrong.) |
I would like some confirmation on the definition of SVLEN when used with different representations of specific variants. I used Tandem Repeats as an example but that could be applicable to others.
SVLEN
should be empty. (Section 3 - SVLEN:The missing value . should be used for all other ALT alleles, including ALT alleles using breakend notation
)<CNV:TR>
in which case it is an SV andSVLEN
represent the length of the reference allele or 1 if novel. (Section 5.7:The SVLEN of the <CNV:TR> is the length of the reference allele. It is not the length of the <CNV:TR> allele
)<INS>
or<DEL>
in which case SVLEN is the length of the actual inserted or deleted bases. (Section 3 - SVLEN:SVLEN is defined for INS, DUP, INV , and DEL symbolic alleles as the number of the inserted, duplicated, inverted, and deleted bases respectively.
)Example bellow
Did I interpret the specs correctly ?
I see two definitions of
SVLEN
: Length of the Structural Variant or length of the reference allele and I find this confusing. (I'm expecting I won't be the only one). Was this intended ?Is it necessary to enforce absence of SVLEN value for non symbolic allele ?
The text was updated successfully, but these errors were encountered: