Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach
β NEW 09/2024 - The paper has been accepted for publication in Computational Linguistics: see the paper here: π
β NEW 21/03/2024 - Test the new Biomistral fine-tuned model:
Code for the article "Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach"
The sparsity of labelled data is an obstacle to the development of Relation Extraction models and the completion of databases in various biomedical areas. While being of high interest in drug-discovery, the natural-products literature, reporting the identifications of potential bioactive compounds from organisms, is a concrete example of such an overlooked topic. To mark the start of this new task, we created the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets. To this end, we developed a new sampler inspired by diversity metrics in ecology, named Greedy Maximum Entropy sampler, or GME-sampler. The strategic optimization of both balance and diversity of the selected items in the evaluation set is important given the resource-intensive nature of manual curation. After quantifying the noise in the training set, in the form of discrepancies between the input abstracts text and the expected output labels, we explored different strategies accordingly. Framing the task as an end-to-end Relation Extraction, we evaluated the performances of standard fine-tuning with models such as BioGPT, GPT-2 and Seq2Rel, as a generative task, and few-shots learning with open Large Language Models (LLaMa 7B-65B). Interestingly, the training sets built with the GME-sampler also exhibit a tendency to tip the precision-recall trade-off of trained models in favour of recall. In addition to their evaluation in few-shots settings, we explore the potential of models (Vicuna-13B) as synthetic data generator and propose a new workflow for this purpose. All evaluated models exhibited substantial improvements when fine-tuned on synthetic abstracts rather than the original noisy data. We provide our best performing (
β Test our BioGPT-Large model fine-tuned on Diversity-synt )
Dataset | Description | Zenodo |
---|---|---|
Synthetic datasets Vicuna-13B-1.3 | Synthetic datasets (training/validation) for end-to-end Relation Extraction of relationships between Organisms and Natural-Products: Diversity-synt, Random-synt, Extended-synt. This dataset is used in the corresponding article | |
(06/12/2023) Synthetic datasets Vicuna-13B-1.5 | A synthetic dataset created from the top-1000 (per biological kingdoms) LOTUS literature references extracted with the GME-sampler. The new dataset was generated using Vicuna-13b-v1.5, derived from LLaMA 2. | |
(21/04/2024) Synthetic datasets Mixtral-8x7B-Instruct | A synthetic dataset created from the top-1000 (per biological kingdoms) LOTUS literature references extracted with the GME-sampler. The new dataset was generated using Mixtral-8x7B-Instruct-v0.1. | |
Evaluation dataset | A curated evaluation dataset for end-to-end Relation Extraction of relationships between organisms and natural-products. |
Model | Description | π€ Hub |
---|---|---|
biogpt-Natural-Products-RE-Diversity-synt-v1.0 | The model is a derived from microsoft/biogpt and was trained on Diversity-synt. | link |
biogpt-Natural-Products-RE-Extended-synt-v1.0 | The model is a derived from microsoft/biogpt and was trained on Extended-synt. | link |
BioGPT-Large-Natural-Products-RE-Diversity-synt-v1.0 | The model is a derived from microsoft/BioGPT-Large and was trained on Diversity-synt. | link |
BioGPT-Large-Natural-Products-RE-Extended-synt-v1.0 | The model is a derived from microsoft/BioGPT-Large and was trained on Extended-synt. | link |
(06/12/2023) BioGPT-Large-Natural-Products-RE-Diversity-1000-synt-v1.1 | The model is a derived from microsoft/BioGPT-Large and was trained on the new synthetic dataset | link |
NEW 21/03/2024 BioMistral-7B-Natural-Products-RE-Diversity-1000-synt-v1.2 | The model is derived from Biomistral and was fine-tuned on a new synthetic dataset produced with Mixtral-8x7B-Instruct. | link |
- Table of contents
- Dataset pre-processing, extraction and formating
- Synthetic data generation
- Fine-tuning
- Seq2rel
- Few-shot learning
- Citation
conda env create -f env/dataset-creator.yml
The original snapshot of the LOTUS database (v.10 - Jan 6, 2023) used in this work is available at .
The LOTUS dataset used is licensed under CC BY 4.0. See the related article and the website for more details.
Secondly, we extracted available links between DOI and PubMed PMID identifiers using a SPARQL request on wikidata. The corresponding result file is provided at data/raw/wikidata-doi2pmid.tsv
.
To preprocess the data, use:
python app/dataset-creation/preprocessing.py \
--lotus-data="/path/to/230106_frozen_metadata.csv" \
--doi2pmid="data/raw/wikidata-doi2pmid.tsv" \
--out-dir="/path/to/output-dir" \
--max-rel-per-ref=20 \
--chemical-name-max-len=60
In this step, we applied the following filters:
-
Duplicates filtering
-
Only references for which a PMID was retrieved (from
wikidata-doi2pmid.tsv
) were conserved. -
All references that report more than k relations were excluded. (We chose
$k=20$ .) -
All associations involving a chemical with a name longer than
$60$ characters (likely to be IUPAC-like) were excluded.
Then, the GME-sampler can be applied on the pre-processed dataset to extract
In the output directory, a sub-directory name entropies
is created and is used to store entropy metrics.
This script can also be use alternatively with other sampling strategies (--sampler arg): "random", "topn_rel", "topn_struct", "topn_org", "topn_sum".
python app/dataset-creation/create_lotus_dataset.py \
--lotus-data="/path/to/processed_lotus_data.tsv" \
--out-dir="path/to/output/dir" \
-N=-1 \
--use-freq \
--n-workers=6 \
--sampler="GME_dutopia"
To fetch the abstracts from PubMed and format the data, use:
python app/dataset-creation/get_abstract_and_format.py \
--input="path/to/dataset.tsv" \
--ncbi-apikey="xxx" \
--chunk-size=100 \
--out-dir="path/to/output/dir"
To fasten the fetching processing, you may need to provide an NCBI ApiKey.
This step is optional, but will provide usefull data for building the exclusion list used in the section Instructions generation.
Synonym extraction can be done manually in 3 simple steps:
-
- Extract the full list of distinct PubChem CID identifiers.
-
- Go the the PubChem Identifier Exchange Service
- Select CIDs in Input ID List
- Same, Isotopes in Operator Type. Same, Isotopes correspond to βsame isotopes and connectivity but different stereochemistry.β Since the harmonization step could have failed to map the correct stereoisomers because of incomplete information, it is better to try to extract all the synonyms of the corresponding CID.
- Synonyms in Output IDs
- Go the the PubChem Identifier Exchange Service
-
- Save the result in a tabular file (e.g CID2synonyms.txt) which will be used to create the exclusion list.
There are two ways of spliting the built dataset between entropy and std (i.e standard)
-
When providing entropy: the top-n (--top-n) DOI associated with the maximum entropy (diversity) will be selected to be integrated in the test set. In this case, both the --entropy-dir and --top-n must be provided. The proportion in which the remaining set is to be split between the training set and the validation set should be indicated with --split-prop. For instance, use "90:10:0" to indicate that 90% of the remaining set will be keep for training and 10% for validation.
-
When providing std: a classic random split is applied according to the proportion expressed in --split-prop (e.g. 80:10:10 by default).
python app/dataset-creation/split_dataset.py \
--dataset="path/to/dataset.json" \
--out-dir="path/to/output/dir" \
--split-mode="entropy" \
--entropy-dir="Path/to/entropies" \
--top-n=50 \
--split-prop="90:10:0"
For generating synthetic abstracts at scale, we strongly recommened spliting the datasets in several batches. Several utilities are available for this purpose, see split_json and merge_json in app/synthetic-data-generation/general_helpers.py
- Step 1: Create the base environment with all the external dependencies.
conda env create -f env/llm.yml
Tips: use mamba instead of conda to build faster !
- Step 2: On this base env, install and compile llama.cpp and install llama-cpp-python. In both cases, we (strongly) recommend using the CuBLAS build for gpu acceleration. However, depending on you environment you may need to tweek the install settings (particularly cuda). For CuBLAS install of llama-cpp-python, use:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Experiments presented in the article were carried using the version 0.1.78
of llama-cpp-python
. However in the meantime, an update of llama.cpp
changed the file format of the models from ggml
to gguf
. Experiments were succesfully reproduced with the more recent version 0.2.11
.
- Warning: Check that the
llama.cpp
build is compatible withllama-cpp-python
, asllama.cpp
will be use to convert and quantize model in ggml format.
For details about models' quantization, please refer to the documentation of llama.cpp.
For instance, in llama.cpp
dir:
# convert the model to ggml FP16 format by providing the path to the directory containing the wieghts
python3 convert.py /path/to/model/weights/dir --outfile="models/ggmluf-model-f16.gguf"
./quantize ./models/ggmluf-model-f16.gguf ./models/ggml-model-q8_0.gguf Q8
For models with Q8
quantization, while we applied Q5_K_M
for bigger models (e.g LLaMA 30B and 65B).
In the paper, we used Vicuna-13b-v1.3. It was quantized using llama.cpp
at master-bbca06e and we used llama-cpp-python
v. 0.1.78.
To more effectively control the expression of NP relationships during the generation step, it is necessary to explicitly formalize the anticipated patterns in upstream instructions. The LLM is first use to extract a list of keyphrases to give a biomedical context (filtered by an exclusion list) and then sample patterns of expression.
The following code will generate a hand of instructions to give an example.
# set up
conda activate llm
# vars
LAUNCHPATH="./app/synthetic-data-generation"
INPUT="data/examples/example-input.json"
SYNPATH="data/examples/example-CID2synonyms.txt"
CLASSPATH="data/extras/mapping_np_classifier.tsv"
MODEL="/path/ggmlorgguf/model/file"
OUTPATH="output/examples/instructions"
CACHEPATH="output/examples/instructions/instructions-cache"
mkdir -p $OUTPATH
mkdir -p $CACHEPATH
python $LAUNCHPATH/run_abstract_instruction.py --input-file=$INPUT \
--chemical-classes=$CLASSPATH \
--method="llama.cpp" \
--out-dir=$OUTPATH \
--use-pubtator \
--model-path-or-name=$MODEL \
--cache-dir=$CACHEPATH \
--path-to-synonyms=$SYNPATH \
--top-n-keywords=10 \
--m-prompts-per-item=10
In the provided settings, the top 10 keywords/keyphrases (--top-n-keywords) of each original seed abstract is first extracted by prompting the LLM with different temperatures, by default:
You can also use the arguments --m-threads and --m-n-gpu to set the number of threads and layers on GPUs. For Vicuna-13B by default all (
More details with:
python $LAUNCHPATH/run_abstract_instruction.py --help
In this step, each of the previously generated prompts are fed into the LLM to generate synthetic abstracts. The created instructions have the following form:
Instructions: Given a title, a list of keywords and main findings, create an abstract for a scientific article.
Title: Metabolites with nematicidal and antimicrobial activities from the ascomycete Lachnum papyraceum (Karst.) Karst. III. Production of novel isocoumarin derivatives, isolation, and biological activities.
Keywords: nematicidal activities, cytotoxic activities, secondary metabolism, cabr2, phytotoxic activities, antimicrobial activities, bromide-containing culture media, phytotoxic, cytotoxic, weak antimicrobial.
Main findings: Five Coumarins, two Miscellaneous polyketides and two Sesquiterpenoids were isolated from Lachnum papyraceum.
Using the selector module, only the top-3 best generations are extracted by the selector module.
With the previously generated instruction examples, the following code will generate synthetic abstracts:
LAUNCHPATH="./app/synthetic-data-generation"
INPUT="output/examples/instructions/prompts_example-input.json"
MODEL="/path/ggmlorgguf/model/file"
OUTPATH="output/examples/generations"
mkdir -p $OUTPATH
python $LAUNCHPATH/run_abstract_generation.py \
--model=$MODEL \
--input-file=$INPUT \
--out-dir=$OUTPATH \
-N=3
- Again, you can also use the arguments --m-threads and --m-n-gpu to set the number of threads and layers on GPUs.
The following section covers the code for the hyperparameter tuning and fine-tuning of BioGPT and the GPT-2 baseline.
conda env create -f env/qlora.yml
Warning: if install of bitsandbytes failed, you may need to install it from source. See for instance here or here and check the variables: BNB_CUDA_VERSION, CUDA, LD_LIBRARY_PATH.
In the article, we performed the tuning with 80 trials, using the median pruner, on the Diversity-synt dataset (see the corresponding dataset on Zenodo here).
In the following example, we only use the previously generated abstracts and a subset of our validation dataset. Also, only
Optuna supports parallelization during hyperparameters tuning. The --tag argument is used to distinguish different runs. You can also vizualise the tuning through the optuna-dashboard.
In these examples, the training set is too small and training will only led to null
conda activate qlora
LAUNCHPATH="./app/biogpt-lora"
VALID_PATH="data/examples-validation/sub_valid.json"
TRAIN_PATH="output/examples/generations/out_prompts_example-input.json"
MODEL_HF="microsoft/biogpt"
N_TRIAL=3
OUTPUT_DIR="output/examples/fine-tuning/hp-tuning/biogpt"
mkdir -p $OUTPUT_DIR
python $LAUNCHPATH/hyperparameters.py --model-name=$MODEL_HF \
--train=$TRAIN_PATH \
--valid=$VALID_PATH \
--out-dir=$OUTPUT_DIR \
--tag="1" \
--n-trials=$N_TRIAL
The hyperparameters estimated for BioGPT were reused for GPT-2.
In the article BioGPT was fine-tuned according to the tuned hyperparameters, i.e:
In this example, we used the previously generated abstracts but the training set is too small and training will only led to null
conda activate qlora
LAUNCHPATH="./app/biogpt-lora"
VALID_PATH="data/examples-validation/sub_valid.json"
TRAIN_PATH="output/examples/generations/out_prompts_example-input.json"
MODEL_HF="microsoft/biogpt"
OUTPUT_DIR="output/examples/fine-tuning/biogpt"
mkdir -p $OUTPUT_DIR
python $LAUNCHPATH/finetune.py --model-name=$MODEL_HF \
--train=$TRAIN_PATH \
--valid=$VALID_PATH \
--batch_size=16 \
--r_lora=8 \
--alpha_lora=16 \
--lr=1e-4 \
--out-dir=$OUTPUT_DIR
See for the other arguments:
python $LAUNCHPATH/finetune.py --help
For training a BioGPT-Large, simply change the --model-name argument by 'microsoft/BioGPT-Large'.
Same for GPT-2 baseline.
LAUNCHPATH="./app/biogpt-lora"
VALID_PATH="data/examples-validation/sub_valid.json"
TRAIN_PATH="output/examples/generations/out_prompts_example-input.json"
MODEL_HF="gpt2-medium"
OUTPUT_DIR="output/examples/fine-tuning/gpt2"
mkdir -p $OUTPUT_DIR
python $LAUNCHPATH/finetune-gpt2.py --model-name=$MODEL_HF \
--train=$TRAIN_PATH \
--valid=$VALID_PATH \
--batch_size=16 \
--r_lora=8 \
--alpha_lora=16 \
--lr=1e-4 \
--out-dir=$OUTPUT_DIR
See for the other arguments:
python $LAUNCHPATH/finetune-gpt2.py --help
Here is an example with a BioGPT-Large model with the adapters fine-tuned with the created Diversity-synt dataset. We use the same evaluation dataset as in the article. The parameters for decoding were also tuned.
You can use either the adapters uploaded on hugginface or simply the path to the directory containing the trained adapeters, by removing the --hf argument.
Similarly, use inference_eval_gpt2.py for inference using GPT-2 models. As their only serve as baselines, we did not pushed them to the HF hub.
conda activate qlora
LAUNCHPATH="./app/biogpt-lora"
MODEL="microsoft/BioGPT-Large"
ADAPTERS="mdelmas/BioGPT-Large-Natural-Products-RE-Diversity-synt-v1.0"
OUTPUTDIR="output/examples/evals/biogpt/4"
TEST="data/test-set/curated_test_set.json"
mkdir -p $OUTPUTDIR
python $LAUNCHPATH/inference_eval.py \
--source-model=$MODEL \
--lora-adapters=$ADAPTERS \
--hf \
--test=$TEST \
--output-dir=$OUTPUTDIR \
--valid-b-size=2
For more details, see the corresponding repository. We are very grateful to the authors for sharing their code.
conda env create -f env/seq2rel.yml
Seq2rel use a particular format and this converter can be used to convert the generated (or raw) data from our json format to seq2rel's format.
cond actvate seq2rel
LAUNCHPATH="./app/seq2rel"
OUTPUTDIR="output/examples/generations/seq2rel"
python $LAUNCHPATH/convert_dataset_to_seq2rel.py \
--input-dir="output/examples/generations" \
--output-dir=$OUTPUTDIR
Similarly to what was done for BioGPT, we again used Optuna like in the original article of Seq2rel.
conda activate seq2rel
TAG="1"
LAUNCHPATH="app/seq2rel"
OUTPUT_DIR="output/examples/fine-tuning/hp-tuning/seq2rel"
OUTPUT_SERIALZE="output/examples/fine-tuning/hp-tuning/seq2rel/$TAG"
CONFIG="data/examples/seq2rel-configs/hp-tuning.jsonnet"
N_TRIAL=3
mkdir -p $OUTPUT_SERIALZE
python $LAUNCHPATH/seq2rel_hp_finetuning.py --out-dir=$OUTPUT_DIR \
--output-serialize=$OUTPUT_SERIALZE \
--config=$CONFIG \
--n-trials=$N_TRIAL
rm -rf $OUTPUT_SERIALZE
For more details and examples, see the corresponding repository.
conda activate seq2rel
LAUNCHPATH="app/seq2rel"
OUTPUT_DIR="output/examples/fine-tuning/seq2rel"
CONFIG="data/examples/seq2rel-configs/seq2rel-finetuning.jsonnet"
allennlp train $CONFIG \
--serialization-dir $OUTPUT_DIR \
--include-package "seq2rel"
Here, we evaluate the performances of a Seq2rel model on the provided curated evaluation dataset.
export TMPDIR="/path/to/tmp/dir"
MODEL="output/examples/fine-tuning/seq2rel/model.tar.gz"
TEST_PATH="data/test-set/seq2rel/curated_test_set.tsv"
OUTPUT_DIR="output/examples/evals/seq2rel"
mkdir -p $OUTPUT_DIR
allennlp evaluate "$MODEL" "$TEST_PATH" \
--output-file "$OUTPUT_DIR/test_metrics.jsonl" \
--predictions-output-file "$OUTPUT_DIR/test_predictions.jsonl" \
--include-package "seq2rel"
Use the same envirionment as for above.
The following script uses archetypal sentences extracted from abstracts as demonstrative examples in the few-shots learning settings. For instance:
INPUT: The antimicrobially active EtOH extracts of Maytenus heterophylla yielded a new dihydroagarofuran alkaloid, 1beta-acetoxy-9alpha-benzoyloxy-2beta,6alpha-dinicotinoyloxy-beta-dihydroagarofuran, together with the known compounds beta-amyrin, maytenfolic acid, 3alpha-hydroxy-2-oxofriedelane-20alpha-carboxylic acid, lup-20(29)-ene-1beta,3beta-diol, (-)-4'-methylepigallocatechin, and (-)-epicatechin.
OUTPUT: Maytenus heterophylla produces 1beta-acetoxy-9alpha-benzoyloxy-2beta,6alpha-dinicotinoyloxy-beta-dihydroagarofuran. Maytenus heterophylla produces beta-amyrin. Maytenus heterophylla produces maytenfolic acid. Maytenus heterophylla produces 3alpha-hydroxy-2-oxofriedelane-20alpha-carboxylic acid. Maytenus heterophylla produces lup-20(29)-ene-1beta,3beta-diol. Maytenus heterophylla produces (-)-4'-methylepigallocatechin. Maytenus heterophylla produces (-)-epicatechin.
From the 5 examples provided, the model is expected to perform the same task implicitely on the last input. Input prompts are adapters to either simply generative (classic) or instructions-tuned models (instruct) via the --prompt-type argument.
conda activate llm
LAUNCHPATH="app/few-shots"
MODEL="/path/to/ggmlorgguf/model"
TEST="data/test-set/curated_test_set.json"
OUTPUT_DIR="output/examples/evals/icl"
mkdir -p $OUTPUT_DIR
python $LAUNCHPATH/call_icl.py --model=$MODEL \
--input-file=$TEST \
--out-dir=$OUTPUT_DIR \
--prompt-type="instruct"
See --help on call_icl.py to see additional arguments.
From the outputed predictions (output_curated_test_set.json), the performances of each model can be evaluated using the following:
LAUNCHPATH="app/few-shots"
PREDS="output/examples/evals/icl/output_curated_test_set.json"
REF="data/test-set/curated_test_set.json"
OUTPUTDIR="output/examples/evals/icl"
python $LAUNCHPATH/compute_metrics.py --input=$PREDS \
--output-file="$OUTPUTDIR/perf.json" \
--test-file=$REF
π If you found the paper and/or this repository useful, please consider citing our work: π
@article{delmas-etal-2024-relation,
title = "Relation Extraction in Underexplored Biomedical Domains: A Diversity-optimized Sampling and Synthetic Data Generation Approach",
author = "Delmas, Maxime and
Wysocka, Magdalena and
Freitas, Andr{\'e}",
journal = "Computational Linguistics",
volume = "50",
number = "3",
month = sep,
year = "2024",
address = "Cambridge, MA",
publisher = "MIT Press",
url = "https://aclanthology.org/2024.cl-3.4",
doi = "10.1162/coli_a_00520",
pages = "953--1000",
abstract = "The sparsity of labeled data is an obstacle to the development of Relation Extraction (RE) models and the completion of databases in various biomedical areas. While being of high interest in drug-discovery, the literature on natural products, reporting the identification of potential bioactive compounds from organisms, is a concrete example of such an overlooked topic. To mark the start of this new task, we created the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets. To this end, we developed a new sampler, inspired by diversity metrics in ecology, named Greedy Maximum Entropy sampler (https://github.com/idiap/gme-sampler). The strategic optimization of both balance and diversity of the selected items in the evaluation set is important given the resource-intensive nature of manual curation. After quantifying the noise in the training set, in the form of discrepancies between the text of input abstracts and the expected output labels, we explored different strategies accordingly. Framing the task as an end-to-end Relation Extraction, we evaluated the performance of standard fine-tuning (BioGPT, GPT-2, and Seq2rel) and few-shot learning with open Large Language Models (LLMs) (LLaMA 7B-65B). In addition to their evaluation in few-shot settings, we explore the potential of open LLMs as synthetic data generators and propose a new workflow for this purpose. All evaluated models exhibited substantial improvements when fine-tuned on synthetic abstracts rather than the original noisy data. We provide our best performing (F1-score = 59.0) BioGPT-Large model for end-to-end RE of natural products relationships along with all the training and evaluation datasets. See more details at https://github.com/idiap/abroad-re.",
}