Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF specs - Clarification of <DUP> vs <DUP:TANDEM> #811

Open
davmlaw opened this issue Feb 5, 2025 · 6 comments
Open

VCF specs - Clarification of <DUP> vs <DUP:TANDEM> #811

davmlaw opened this issue Feb 5, 2025 · 6 comments
Labels

Comments

@davmlaw
Copy link

davmlaw commented Feb 5, 2025

Hi, thanks for all your hard work maintaining the specs

I recently processed a Manta VCF that had alt=<DUP:TANDEM>

In the VCFv4.5 spec (also 4.2, and possibly earlier) the symbolic alts are defined as:

DUP Region of elevated copy number relative to the reference, or a tandem duplication breakpoint
DUP:TANDEM Tandem duplication

I am wondering are there any cases where <DUP> represents a duplication that is not tandem?

If not, is there a a reason for having both? A downside of multiple representations for the same molecular event is tools / users miss that a variant is the same

@davmlaw davmlaw changed the title Clarification of <DUP> vs <DUP:TANDEM> VCF specs - Clarification of <DUP> vs <DUP:TANDEM> Feb 5, 2025
@d-cameron
Copy link
Contributor

d-cameron commented Feb 6, 2025 via email

@d-cameron
Copy link
Contributor

d-cameron commented Feb 6, 2025 via email

@davmlaw
Copy link
Author

davmlaw commented Feb 6, 2025

Thanks. I didn't know about SVCLAIM (though I haven't seen VCF 4.4+ in the wild)

I want to represent VCF variants from many different callers in 1 and only 1 way (database constraint to ensure uniqueness)

It looks like: <DUP> SVCLAIM=J and <DUP:TANDEM> SVCLAIM=J describe the same thing?

So it would be right to normalize the first into the second (being more explicit?)

@davmlaw
Copy link
Author

davmlaw commented Feb 7, 2025

Actually, thinking about it more, I'm always going to have <= 4.3 VCF records, so there is ambiguity in those <DUP>s. There might be in any SVCLAIM call too (as it's all just best guesses all the way down to the base call)

So if I want to group together records that may be the same, I'd be better off "downcasting" <DUP:TANDEM> to <DUP> to link with plain <DUP>s as they could be the same event between samples, level but keeping the callers claims (eg SVCLAIM or <DUP:TANDEM> etc) for that sample call

@davmlaw
Copy link
Author

davmlaw commented Feb 7, 2025

Imagine you wanted to merge 3 vcf files

A: <DUP>
B: <DUP> SVCLAIM=J
C: <DUP:TANDEM> SVCLAIM=J

I think they should be merged, but you can't do that without losing information as you'd have to throw away SVCLAIM

It seems that SVCLAIM should be a per-sample FORMAT field?

@d-cameron d-cameron added the vcf label Mar 24, 2025
@d-cameron
Copy link
Contributor

d-cameron commented Mar 24, 2025

It seems that SVCLAIM should be a per-sample FORMAT field?

It's an INFO field as it's defining the meaning of the ALT field.

I think they should be merged, but you can't do that without losing information as you'd have to throw away SVCLAIM

If you're merging pre-4.4 records with 4.4 records then the merging code need to resolve the ambiguity of the <DUP> claim. Generally speaking, that requires knowing the providence of the VCF (i.e. did it come from an SV caller or a CNV caller?).

The use case of being able to support both ambiguous and unambiguous <DEL>/<DUP> records within a single VCF file is not sufficiently compelling so as to justify a change to the definition of the SVCLAIM field.

I want to represent VCF variants from many different callers in 1 and only 1 way (database constraint to ensure uniqueness)

This is going to be more difficult than you envisage. Take the following haploid haplotype an example:

REF: ACGTTTACG
ALT: ACGTTTTTTACG

This 3bp tandem duplication haplotype can be represent in VCF by many variants including the following:

chr 1 full_chr_sequence ACGTTTACG ACGTTTTTTACG

chr 4 vrs_normalised TTT TTTTTT

chr 3 left_aligned G GTTT

chr 4 tr_start T  TTTT

chr 5 centre_aligned T TTTT

chr 6 right_aligned T TTTT

chr 6 right_aligned_bases_prepended A TTTA

chr 3 symbolic_dup G <DUP> END=6

chr 3 symbolic_dup_tandem G <DUP:TANDEM> END=6

chr 3 symbolic_dup_subtype G <DUP:anything:you:want> END=6

chr 3 symbolic_ins_pos3 G <INS> SVLEN=3

chr 4 symbolic_ins_pos4 T <INS> SVLEN=3

chr 5 symbolic_ins_pos5 T <INS> SVLEN=3

chr 6 symbolic_ins_pos6 T <INS> SVLEN=3

chr 4 breakend_dup_notation_record_A T ]chr:6]T MATEID=breakend_dup_notation_record_B
chr 6 breakend_dup_notation_record_B T T[chr:4[ MATEID=breakend_dup_notation_record_A

chr 3 breakend_ins_A_pos3 G GTTT[chr:4[ MATEID=breakend_ins_B_pos3
chr 4 breakend_ins_B_pos3 T ]chr:3]TTTT MATEID=breakend_ins_A_pos3 

chr 3 breakend_delins_A G GTTTTTT[chr:7[ MATEID=breakend_delins_B 
chr 7 breakend_delins_B A ]chr:3]TTTTTTA MATEID=breakend_delins_A 

chr 3 cnv G <CNV> END=6;CN=2

chr 3 cnv_tr G <CNV:TR> SVLEN=3 GT:RU:RUS:CN 1:6:T:2


chr 3 decomposed_into_smaller_insertions_A A AT
chr 4 decomposed_into_smaller_insertions_B T TT
chr 6 decomposed_into_smaller_insertions_C T TT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants