Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lknegendorf
Copy link

Currently, when using the getref vfbd_full (...) command downloading the full VFDB dataset, it is not possible to proceed with ariba preparef (...) using the resulting reference data without manual changes to both the .fa and the .tsv files. This is, because the reference data set contains several pitfalls not adressed yet:

  • Duplicate sequence IDs raising errors in ariba prepareref.
  • Sequences with stop codons, which are filtered out. (The metadata.tsv file created by vfdb_parser is currently declaring every sequence from the dataset as gene).
  • Gene symbols including brackets or blank spaces, so the intended naming is not working for every sequence complicating the creation of meaningful cluster names.

The modifications proposed here adress all shortcomings mentioned above.
Furthermore, the xls-derived metadata from VFDB explaining function and mechanism of a respective virulence gene (VFs.xls.gz, see VFs description file on VFDB download page) are included into the metadata.tsv derived from vfdb_parser to allow a more comprehensive view of the ariba variant calling results for working with VFDB.

Thank you for considering to merge for a future release.

Change in <vfdb_id> group solves issue concerning vfdb GeneIDs not attributed with GenBank Accession.
Change in <name> group solves issue concerning gene names including whitespace characters or brackets (e.g.  `cryIA(a)`).
Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-14, VFdbParser._fa_header_to_name_pieces did not return any None values.
Newly implemented functions: Extracts VFIDs from <description> part of seq.id in VFDB .fa-file and downloads VFs.xls.gz file from VFDB and links VFIDs to create metadata file including more information.
Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-17.
Still, manual changes are needed to resulting .fa file as VFDB contains duplicates.
Included a list-based filter for duplicate sequence ids in the downloaded VFDB fasta file. As consequence, ´ariba prepareref´ can be run after execution of vfdb_parser without manual deletion of duplicate entries.
Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-18. Command ´ariba prepareref´ is not raising error because of duplicate seq.id (but 1254 sequences are filtered out because they are not recognized as genes though).
Included a check if a sequence can be translated making use of methods from pyfastaq. If sequence can not be translated, it is declared as non-coding in resulting metadata file, allowing processing with `ariba prepareref` without filtering of such sequences.
Included funktion reporting maximum length giving advise for choice of parameters in further processing.
Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-19. Command ´ariba prepareref´ is not removing any sequence from dataset (when run with advised parameters from vfdb_parser.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant