You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using last version of VFDB via ARIBA getref and prepareref, an error arises saying there is a duplicate entry
Traceback (most recent call last):
File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/bin/ariba", line 312, in <module>
args.func(args)
File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/tasks/prepareref.py", line 34, in run
preparer.run(options.outdir)
File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/ref_preparer.py", line 186, in run
genetic_code=self.genetic_code,
File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 34, in __init__
self.sequences, self.metadata = ReferenceData._load_input_files_and_check_seq_names(fasta_files, metadata_tsv_files)
File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 136, in _load_input_files_and_check_seq_names
all_seqs = ReferenceData._load_all_fasta_files(fasta_files)
File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 128, in _load_all_fasta_files
ReferenceData._load_fasta_file(filename, seq_dict)
File "/imppc/labs/lslab/jsanchez/conda_package/miniconda3/envs/BacterialTyper_mamba/lib/python3.7/site-packages/ariba/reference_data.py", line 119, in _load_fasta_file
raise Error('Duplicate name "' + seq.id + '" found in file ' + filename + '. Cannot continue)')
ariba.reference_data.Error: Duplicate name "stx2B.VFG000838(gb|WP_000738068).Escherichia_coli_O157:H7_str._EDL933" found in file /imppc/labs/lslab/share/data/references/BacterialTyper_database/ARIBA/vfdb_full/vfdb_full.fa. Cannot continue)
I manually checked and there is a duplicated entry.
I used mothur to remove the fasta sequence. Both entries will be removed and I will add only 1 later. But when calling mothur, many warnings appeared related to the same issue:
[WARNING]: clbA.VFG049147(gb|WP_001217110).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbD.VFG049150(gb|WP_000982270).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbF.VFG049152(gb|WP_000337350).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbG.VFG049153(gb|WP_000159201).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbL.VFG049158(gb|WP_001297937).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbO.VFG049161(gb|WP_001029878).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: clbQ.VFG049163(gb|WP_000065646).Klebsiella_pneumoniae_subsp._pneumoniae_1084 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: iucB.VFG000616(gb|NP_709455).Shigella_flexneri_2a_str._301 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: sigA.VFG000630(gb|NP_708742).Shigella_flexneri_2a_str._301 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: pic.VFG000635(gb|NP_708747).Shigella_flexneri_2a_str._301 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
**** Exceeded maximum allowed command warnings, silencing warnings ****
[WARNING]: set1B.VFG000636(gb|AAW31739).Shigella_flexneri_2a_str._301 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: stxB.VFG001829(gb|WP_000752026).Shigella_dysenteriae_Sd197 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: hcp1/tssD1.VFG049898(gb|WP_001284199).Shigella_sonnei_Ss046 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: tssL.VFG049905(gb|WP_000343289).Shigella_sonnei_Ss046 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
[WARNING]: ybtU.VFG000361(gb|WP_000982866).Yersinia_pestis_CO92 is in your fasta file more than once. Mothur requires sequence names to be unique. I will only add it once.
Removed 2 sequences from your fasta file.
As stated, Mothur requires sequence names to be unique and it will only add it once.
The file vfdb_full.tsv also contains duplicated entries too that should be discarded: sort file | uniq | wc.
I additionally had to remove ":". Some strains are named differentially generaring inconsistencies in names
I manually remove ":" characters using sed and replaced by "_". I additionally remove "|", "(" and ")" all replaced by "_" in all cases.
In the case of BacterialTyper to fix and escape this issue t would be required to modify script (ariba_caller.py) to include this fix to character issues.
When using last version of VFDB via ARIBA getref and prepareref, an error arises saying there is a duplicate entry
I manually checked and there is a duplicated entry.
I used mothur to remove the fasta sequence. Both entries will be removed and I will add only 1 later. But when calling mothur, many warnings appeared related to the same issue:
As stated, Mothur requires sequence names to be unique and it will only add it once.
The file vfdb_full.tsv also contains duplicated entries too that should be discarded:
sort file | uniq | wc
.I additionally had to remove ":". Some strains are named differentially generaring inconsistencies in names
Escherichia_coli_O127_H6_str
Escherichia_coli_O127:H6_str
I manually remove ":" characters using sed and replaced by "_". I additionally remove "|", "(" and ")" all replaced by "_" in all cases.
In the case of BacterialTyper to fix and escape this issue t would be required to modify script (ariba_caller.py) to include this fix to character issues.
BacterialTyper/BacterialTyper/scripts/ariba_caller.py
Lines 344 to 348 in b617549
The text was updated successfully, but these errors were encountered: