Skip to content

Support for strutural variants in vcf_prepper #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

nakib103
Copy link
Contributor

@nakib103 nakib103 commented Feb 14, 2025

Support for structural variant

ENSVAR-6684

Running in structural variant mode

To run this script in structural variant mode we have added a 4th parameter. If set, the bin will run in structural variant mode and base the logic on that.

./vcf_to_bed sv_example.vcf.gz sv_example.bed variation_consequnce_rank.json 1

An pipeline parameter has been added - params.structural_variant to turn it on at the pipeline level.
When running in structural_variant mode it affects the generate_vep_config and vcf_to_bed process -

vcf_to_bed

variant group

The SVs are grouped into 5 types depending on their variant class. The details are in the doc -
https://docs.google.com/spreadsheets/d/1TfvsMBFJFfHZrRIrVkFVfQbAbfRm7LQmTujQPY6BPJw/edit?gid=0#gid=0

A new hashmap has been added and used in structural variant mode.

This PR addresses 2 more issue related to SV in vcf_to_bed

  • In case we receive a sequence_alteration as variant class for SV we should not try to calculate variant class (as the calculation is based on short variant) but return the class as is.
  • Calculate end from INFO/SVLEN and if not available or SVLEN=0 then INFO/END. For, insertion and breakend types the position has will be updated to paint them as single point variant like SNV insertions. (In future, we might need to put the SVLEN information somewhere for them in case EV design needs them).

generate_vep_config

When creating vep_config INI file for VEP in structural_variant mode so -

  • we do not add any plugin or frequency for now.
  • reduce --buffer_size to 50 to avoid cross-contamination error
  • increase --max_sv_size to 1 billion bp

Handling multiple source files

  • We already have multiple source for SVs for a genome. We should not create symlink for focus track when there is multiple source. And we should have different name for VCF files for different source.

Non-SV related update

  • The get_db_name function can get multiple databases now with the introduction of the integrated and partial databases. For example it will match both the following database -
SHOW DATABASES LIKE 'homo_sapiens_core%110%';
+-----------------------------------+
| Database (homo_sapiens_core%110%) |
+-----------------------------------+
| homo_sapiens_core_110_38          |
| homo_sapiens_core_110_38_partials |
+-----------------------------------+

For now, just return the first one and post a warning message.

  • the generate_synonym_file module now add chr prefixed name in the synonym file by default (e.g. - 1 --> chr1). Before we had to rely on core database to have those synonyms. But I noticed those missing in some core database.

  • The output file name from VEP has the format __VEP.vcf.gz. But if either genome or source have . in its name then it would get truncated. Replacing the . with _ while providing nextflow-vep with a output file name.

Test

Example SV vcf file can be obtained from dbVar -
https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/

@nakib103 nakib103 marked this pull request as draft February 14, 2025 10:28
@nakib103 nakib103 marked this pull request as ready for review May 12, 2025 17:44
@nakib103 nakib103 changed the title Support for SV in vcf_to_bed Support for strutural variants in vcf_prepper Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants