Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in ARIBA VFDB Database #19

Open
JFsanchezherrero opened this issue Jul 7, 2022 · 0 comments
Open

Error in ARIBA VFDB Database #19

JFsanchezherrero opened this issue Jul 7, 2022 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@JFsanchezherrero
Copy link
Member

JFsanchezherrero commented Jul 7, 2022

When using last version of VFDB via ARIBA getref and prepareref, an error arises saying there is a duplicate entry

Traceback (most recent call last):
  File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/bin/ariba", line 312, in <module>
    args.func(args)
  File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/tasks/prepareref.py", line 34, in run
    preparer.run(options.outdir)
  File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/ref_preparer.py", line 186, in run
    genetic_code=self.genetic_code,
  File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 34, in __init__
    self.sequences, self.metadata = ReferenceData._load_input_files_and_check_seq_names(fasta_files, metadata_tsv_files)
  File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 136, in _load_input_files_and_check_seq_names
    all_seqs = ReferenceData._load_all_fasta_files(fasta_files)
  File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 128, in _load_all_fasta_files
    ReferenceData._load_fasta_file(filename, seq_dict)
  File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 119, in _load_fasta_file
    raise Error('Duplicate name "' + seq.id + '" found in file ' + filename + '. Cannot continue)')
ariba.reference_data.Error: Duplicate name "stx2B.VFG000838(gb|WP_000738068).Escherichia_coli_O157:H7_str._EDL933" found in file /imppc/labs/lslab/share/data/references/BacterialTyper_database/ARIBA/vfdb_full/vfdb_full.fa. Cannot continue)

I manually checked and there is a duplicated entry.

I used mothur to remove the fasta sequence. Both entries will be removed and I will add only 1 later. But when calling mothur, many warnings appeared related to the same issue:

mothur > set.dir(output=./)
Mothur's directories:
outputDir=/imppc/labs/lslab/share/data/references/BacterialTyper_database/ARIBA/vfdb_full/

mothur > remove.seqs(accnos=id2remove.txt, fasta=vfdb_full.fa)

[WARNING]: clbA.VFG049147(gb|WP_001217110).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbD.VFG049150(gb|WP_000982270).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbF.VFG049152(gb|WP_000337350).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbG.VFG049153(gb|WP_000159201).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbL.VFG049158(gb|WP_001297937).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbO.VFG049161(gb|WP_001029878).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbQ.VFG049163(gb|WP_000065646).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: iucB.VFG000616(gb|NP_709455).Shigella_flexneri_2a_str._301 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: sigA.VFG000630(gb|NP_708742).Shigella_flexneri_2a_str._301 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: pic.VFG000635(gb|NP_708747).Shigella_flexneri_2a_str._301 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.

**** Exceeded maximum allowed command warnings, silencing warnings ****
[WARNING]: set1B.VFG000636(gb|AAW31739).Shigella_flexneri_2a_str._301 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: stxB.VFG001829(gb|WP_000752026).Shigella_dysenteriae_Sd197 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: hcp1/tssD1.VFG049898(gb|WP_001284199).Shigella_sonnei_Ss046 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: tssL.VFG049905(gb|WP_000343289).Shigella_sonnei_Ss046 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: ybtU.VFG000361(gb|WP_000982866).Yersinia_pestis_CO92 is in your fasta file more than once.  Mothur requires sequence names to be unique. I will only add it once.
Removed 2 sequences from your fasta file.

As stated, Mothur requires sequence names to be unique and it will only add it once.

The file vfdb_full.tsv also contains duplicated entries too that should be discarded: sort file | uniq | wc.

I additionally had to remove ":". Some strains are named differentially generaring inconsistencies in names

Escherichia_coli_O127_H6_str
Escherichia_coli_O127:H6_str

I manually remove ":" characters using sed and replaced by "_". I additionally remove "|", "(" and ")" all replaced by "_" in all cases.

In the case of BacterialTyper to fix and escape this issue t would be required to modify script (ariba_caller.py) to include this fix to character issues.

def ariba_prepareref(fasta, metadata, outfolder, threads):
## prepareref
cmd_prepareref = 'ariba prepareref -f %s -m %s %s --threads %s' %(fasta, metadata, outfolder, threads)
return(HCGB_sys.system_call(cmd_prepareref))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant