Support for strutural variants in vcf_prepper #24

nakib103 · 2025-02-14T10:28:44Z

Support for structural variant

Running in structural variant mode

To run this script in structural variant mode we have added a 4th parameter. If set, the bin will run in structural variant mode and base the logic on that.

./vcf_to_bed sv_example.vcf.gz sv_example.bed variation_consequnce_rank.json 1

An pipeline parameter has been added - params.structural_variant to turn it on at the pipeline level.
When running in structural_variant mode it affects the generate_vep_config and vcf_to_bed process -

`vcf_to_bed`

variant group

The SVs are grouped into 5 types depending on their variant class. The details are in the doc -
https://docs.google.com/spreadsheets/d/1TfvsMBFJFfHZrRIrVkFVfQbAbfRm7LQmTujQPY6BPJw/edit?gid=0#gid=0

A new hashmap has been added and used in structural variant mode.

This PR addresses 2 more issue related to SV in vcf_to_bed

In case we receive a sequence_alteration as variant class for SV we should not try to calculate variant class (as the calculation is based on short variant) but return the class as is.
Calculate end from INFO/SVLEN and if not available or SVLEN=0 then INFO/END. For, insertion and breakend types the position has will be updated to paint them as single point variant like SNV insertions. (In future, we might need to put the SVLEN information somewhere for them in case EV design needs them).

`generate_vep_config`

When creating vep_config INI file for VEP in structural_variant mode so -

we do not add any plugin or frequency for now.
reduce --buffer_size to 50 to avoid cross-contamination error
increase --max_sv_size to 1 billion bp

Handling multiple source files

We already have multiple source for SVs for a genome. We should not create symlink for focus track when there is multiple source. And we should have different name for VCF files for different source.

Non-SV related update

The get_db_name function can get multiple databases now with the introduction of the integrated and partial databases. For example it will match both the following database -

SHOW DATABASES LIKE 'homo_sapiens_core%110%';
+-----------------------------------+
| Database (homo_sapiens_core%110%) |
+-----------------------------------+
| homo_sapiens_core_110_38          |
| homo_sapiens_core_110_38_partials |
+-----------------------------------+

For now, just return the first one and post a warning message.

the generate_synonym_file module now add chr prefixed name in the synonym file by default (e.g. - 1 --> chr1). Before we had to rely on core database to have those synonyms. But I noticed those missing in some core database.
The output file name from VEP has the format __VEP.vcf.gz. But if either genome or source have . in its name then it would get truncated. Replacing the . with _ while providing nextflow-vep with a output file name.

Test

Example SV vcf file can be obtained from dbVar -
https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/

nakib103 marked this pull request as draft February 14, 2025 10:28

nakib103 marked this pull request as ready for review May 12, 2025 17:44

nakib103 requested a review from likhitha-surapaneni May 14, 2025 13:14

nakib103 assigned likhitha-surapaneni May 14, 2025

nakib103 added 11 commits May 30, 2025 11:23

Support for SV

22775d9

Add end position from SVLEN/END

445d0d3

Consider case where SVLEN=0 and multi-allelic

e6f3765

Add new csq in separate PR

82429b1

Consider cases where SVLEN can be negative

d276517

Add support for different variant grouping of sv

26527d7

Add structural variant parameter

55cfc35

Update insertion and breakend positions

9e9d6b4

Only return the first match of multiple db is found

6bc458d

Modifying the warning message

94e28e5

Do not add plugin or frequency for SV

8c60811

nakib103 force-pushed the sv_support branch from f838edb to 8c60811 Compare May 30, 2025 10:28

nakib103 added 11 commits June 3, 2025 15:04

Buffer size and max sv size update for VEP config

68e22fe

Do not error out if no RAF field in VCF

07e813b

Keep the END and SVLEN from source INFO

a8e9452

Do not symlink if there are multiple source

d40909c

For insertion and breakend types the end is same as start

107ae03

For multiple source update the output VCF file name with source

7b17127

Add synonyms for chr prefix chromsome

e5f2dee

Fix for source or genome with . in name

73b372a

Typo

1636950

Allow non-symbolic SVs

6410fbe

Add the binary

af07e7f

nakib103 changed the title ~~Support for SV in vcf_to_bed~~ Support for strutural variants in vcf_prepper Jun 9, 2025

Fix symbolic alts logic

4b00468

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for strutural variants in vcf_prepper #24

Support for strutural variants in vcf_prepper #24

Uh oh!

nakib103 commented Feb 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Support for strutural variants in vcf_prepper #24

Are you sure you want to change the base?

Support for strutural variants in vcf_prepper #24

Uh oh!

Conversation

nakib103 commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support for structural variant

Running in structural variant mode

vcf_to_bed

variant group

This PR addresses 2 more issue related to SV in vcf_to_bed

generate_vep_config

Handling multiple source files

Non-SV related update

Test

Uh oh!

Uh oh!

nakib103 commented Feb 14, 2025 •

edited

Loading

`vcf_to_bed`

`generate_vep_config`