To reproduce the results with SpeciesAssignment you need first to convert the BELB gene corpora into the PubTator format.
python -m scripts.convert_belb_to_pubtator --run input --belb_dir ~/data/belb
Download and extract the standalone of GNorm2
Move the files in data/species_assignment/text
into the GNorm2 directory ( e.g. belb_input
).
The to recognize the species mentioned run:
java -Xmx60G -Xms30G -jar GNormPlus.jar belb_input belb_input_SR setup.SR.txt
Move the folder belb_input_SR
into data/species_assignment/text_species
Now we are going to add gene annotations to the files:
python -m scripts.convert_belb_to_pubtator --run append --belb_dir ~/data/belb
Download and extract SpeciesAssignment.
Please follow their instruction on how to setup the environment to run the tool.
Then move the folder data/species_assignment/text_species_gene
into belb_input
in the SpeciesAssignment folder.
Then run:
cd src
python Species_Assignment.py -i ../belb_input/gnormplus_test.PubTator -m ../speass_trained_models/SpeAss-PubmedBERT-SG.h5 -o ../results_gnormplus_test
python Species_Assignment.py -i ../belb_input/nlm_gene_test.PubTator -m ../speass_trained_models/SpeAss-PubmedBERT-SG.h5 -o ../results_nlm_gene_test
And then move the results in data/species_assignment/text_species_gene_assign