Skip to content

VCF4.4 SVLEN requirement across different variant representations #769

@tcezard

Description

@tcezard

I would like some confirmation on the definition of SVLEN when used with different representations of specific variants. I used Tandem Repeats as an example but that could be applicable to others.

  • A tandem repeat can be represented with full sequence (No symbolic allele) in which case it is not considered a SV so SVLEN should be empty. (Section 3 - SVLEN: The missing value . should be used for all other ALT alleles, including ALT alleles using breakend notation)
  • A tandem repeat can be represented as a symbolic allele <CNV:TR> in which case it is an SV and SVLEN represent the length of the reference allele or 1 if novel. (Section 5.7: The SVLEN of the <CNV:TR> is the length of the reference allele. It is not the length of the <CNV:TR> allele)
  • A tandem repeat can also be represented as a symbolic allele <INS> or <DEL> in which case SVLEN is the length of the actual inserted or deleted bases. (Section 3 - SVLEN: SVLEN is defined for INS, DUP, INV , and DEL symbolic alleles as the number of the inserted, duplicated, inverted, and deleted bases respectively.)

Example bellow

##fileformat=VCFv4.4
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=CN,Number=A,Type=Float,Description="Copy number of allele">
##INFO=<ID=RN,Number=A,Type=Integer,Description="Total number of repeat sequences in this allele">
##INFO=<ID=RUS,Number=.,Type=String,Description="Repeat unit sequence of the corresponding repeat sequence">
##INFO=<ID=RUL,Number=.,Type=Integer,Description="Repeat unit length of the corresponding repeat sequence">
##INFO=<ID=RUC,Number=.,Type=Float,Description="Repeat unit count of corresponding repeat sequence">
##INFO=<ID=RB,Number=.,Type=Integer,Description="Total number of bases in the corresponding repeat sequence">
##INFO=<ID=RUB,Number=.,Type=Integer,Description="Number of bases in each individual repeat unit">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set">
##ALT=<ID=CNV:TR,Description="Tandem repeat determined based on DNA abundance">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chr1 130 . G GCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG . . GT:PS 1|0:100
chr1 130 . G <CNV:TR> END=130;SVLEN=1;CN=20;RUS=CAG;RN=1;RB=60 . GT:PS 1|0:100
chr1 130 . G <INS> END=130;SVLEN=60; . GT:PS 1|0:100

Did I interpret the specs correctly ?

I see two definitions of SVLEN: Length of the Structural Variant or length of the reference allele and I find this confusing. (I'm expecting I won't be the only one). Was this intended ?

Is it necessary to enforce absence of SVLEN value for non symbolic allele ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions