The repository contains the code of the recent research advances in Shannon.AI. Please post github issues or email [email protected] for relevant questions.
CorefQA: Coreference Resolution as Query-based Span Prediction
Wei Wu, Fei Wang, Arianna Yuan, Fei Wu and Jiwei Li
In ACL 2020. paper
If you find this repo helpful, please cite the following:
@article{wu2019coreference,
title={Coreference Resolution as Query-based Span Prediction},
author={Wu, Wei and Wang, Fei and Yuan, Arianna and Wu, Fei and Li, Jiwei},
journal={arXiv preprint arXiv:1911.01746},
year={2019}
}
- Overview
- Hardware Requirements
- Install Package Dependencies
- Data Preprocess
- Download Pretrained MLM
- Training
- Evaluation and Prediction
- Download the Final CorefQA Model
- Descriptions of Directories
- Acknowledgement
- Useful Materials
- Contact
The model introduces +3.5 (83.1) F1 performance boost over previous SOTA coreference models on the CoNLL benchmark. The current codebase is written in Tensorflow. We plan to release the PyTorch version soon. The current code version only supports training on TPUs and testing on GPUs (due to the annoying features of TF and TPUs). You thus have to bear the trouble of transferring all saved checkpoints from TPUs to GPUs for evaluation (we will fix this soon). Please follow the parameter setting in the log directionary to reproduce the performance.
Model | F1 (%) |
---|---|
Previous SOTA (Joshi et al., 2019a) | 79.6 |
CorefQA + SpanBERT-large | 83.1 |
TPU for training: Cloud TPU v3-8 device (128G memory) with Tensorflow 1.15 Python 3.5
GPU for evaluation: with CUDA 10.0 Tensorflow 1.15 Python 3.5
$ python3 -m pip install --user virtualenv
$ virtualenv --python=python3.5 ~/corefqa_venv
$ source ~/corefqa_venv/bin/activate
$ cd CorefQA
$ pip install -r requirements.txt
# If you are using TPU, please run the following commands:
$ pip install --upgrade google-api-python-client
$ pip install --upgrade oauth2client
- Download the offical released Ontonotes 5.0 (LDC2013T19).
- Preprocess Ontonotes5 annotations files for the CoNLL-2012 coreference resolution task.
Run the command with Python 2bash ./scripts/data/preprocess_ontonotes_annfiles.sh <path_to_LDC2013T19-ontonotes5_directory> <path_to_save_CoNLL12_coreference_resolution_directory> <language>
and it will create{train/dev/test}.{language}.v4_gold_conll
files in the directory<path_to_save_CoNLL12_coreference_resolution_directory>
.
<language>
can beenglish
,arabic
orchinese
. In this paper, we set<language>
toenglish
.
If you want to use Python 3, please refer to the guideline - Generate TFRecord files for experiments.
Run the command with Python 3bash ./scripts/data/generate_tfrecord_dataset.sh <path_to_save_CoNLL12_coreference_resolution_directory> <path_to_save_tfrecord_directory> <path_to_pretrain_mlm_vocab_file>
and it will create{train/dev/test}.overlap.corefqa.{language}.tfrecord
files in the directory<path_to_save_CoNLL12_coreference_resolution_directory>
.
In our experiments, we used pretrained mask language models to initialize the mention_proposal and corefqa models.
- Download the pretrained models.
Runbash ./scripts/data/download_pretrained_mlm.sh <path_to_save_pretrained_mlm> <model_sign>
to download and unzip the pretrained mlm models.
<model_sign>
shoule take the value of[bert_base, bert_large, spanbert_base, spanbert_large, bert_tiny]
.
bert_base, bert_large, spanbert_base, spanbert_large
are trained with a cased(upppercase and lowercase tokens) vocabulary. Should use the cased train/dev/test coreference datasets.bert_tiny
is trained with a uncased(lowercase tokens) vocabulary. We use the tinyBERT model for fast debugging. Should use the uncased train/dev/test coreference datasets.
- Transform SpanBERT from
Pytorch
toTensorflow
.
After downloading bert_<scale>
to <path_to_bert_<scale>_tf_dir>
and spanbert_<scale>
to <path_to_spanbert_<scale>_pytorch_dir>
, you can start transforming the SpanBERT model to Tensorflow and the model is saved to the directory <path_to_save_spanbert_tf_checkpoint_dir>
. <scale>
should take the value of [base, large]
.
We need to tranform the SpanBERT checkpoints from Pytorch to TF because the offical relased models were trained with Pytorch.
Run bash ./scripts/data/transform_ckpt_pytorch_to_tf.sh <model_name> <path_to_spanbert_<scale>_pytorch_dir> <path_to_bert_<scale>_tf_dir> <path_to_save_spanbert_tf_checkpoint_dir>
and the <model_name>
in TF will be saved in <path_to_save_spanbert_tf_checkpoint_dir>
.
<model_name>
should take the value of[spanbert_base, spanbert_large]
.<scale>
indicates that thebert_model.ckpt
in the<path_to_bert_<scale>_tf_dir>
should have the same scale(base, large) to thebert_model.bin
in<path_to_spanbert_<scale>_pytorch_dir>
.
Follow the pipeline described in the paper, you need to:
- load a pretrained SpanBERT model.
- finetune the SpanBERT model on the combination of Squad and Quoref datasets.
- pretrain the mention proposal model on the coref dataset.
- jointly train the mention proposal model and the mention linking model.
Notice: We provide the options of both pretraining these models yourself and loading the our pretrained models for 2) and 3).
We finetune the SpanBERT model on the SQuAD 2.0 and Quoref QA tasks for data augmentation before the coreference resolution task.
-
You can directly download the pretrained model on the datasets. Download Data Augmentation Models on Squad and Quoref link
Run./scripts/data/download_squad2_finetune_model.sh <model-scale> <path-to-save-model>
to download finetuned SpanBERT on SQuAD2.0.
The<model-scale>
should take the value of[base, large]
.
The<path-to-save-model>
is the path to save finetuned spanbert on SQuAD2.0 datasets. -
Or start to finetune the SpanBERT model on QA tasks yourself.
- Download SQuAD 2.0 train and dev sets.
- Download Quoref train and dev sets.
- Finetune the SpanBERT model on Google Could V3-8 TPU.
For Squad 2.0, Run the script in ./script/model/squad_tpu.sh
REPO_PATH=/home/shannon/coref-tf
export TPU_NAME=tf-tpu
export PYTHONPATH="$PYTHONPATH:$REPO_PATH"
SQUAD_DIR=gs://qa_tasks/squad2
BERT_DIR=gs://pretrained_mlm_checkpoint/spanbert_large_tf
OUTPUT_DIR=gs://corefqa_output_squad/spanbert_large_squad2_2e-5
python3 ${REPO_PATH}/run/run_squad.py \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$BERT_DIR/bert_model.ckpt \
--do_train=True \
--train_file=$SQUAD_DIR/train-v2.0.json \
--do_predict=True \
--predict_file=$SQUAD_DIR/dev-v2.0.json \
--train_batch_size=8 \
--learning_rate=2e-5 \
--num_train_epochs=4.0 \
--max_seq_length=384 \
--do_lower_case=False \
--doc_stride=128 \
--output_dir=${OUTPUT_DIR} \
--use_tpu=True \
--tpu_name=$TPU_NAME \
--version_2_with_negative=True
After getting the best model (choose based on the performance on dev set) on SQuAD2.0
, you should start finetuning the saved model on Quoref
.
Run the script in ./script/model/quoref_tpu.sh
REPO_PATH=/home/shannon/coref-tf
export TPU_NAME=tf-tpu
export PYTHONPATH="$PYTHONPATH:$REPO_PATH"
QUOREF_DIR=gs://qa_tasks/quoref
BERT_DIR=gs://corefqa_output_squad/panbert_large_squad2_2e-5
OUTPUT_DIR=gs://corefqa_output_quoref/spanbert_large_squad2_best_quoref_3e-5
python3 ${REPO_PATH}/run_quoref.py \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$BERT_DIR/best_bert_model.ckpt \
--do_train=True \
--train_file=$QUOREF_DIR/quoref-train-v0.1.json \
--do_predict=True \
--predict_file=$QUOREF_DIR/quoref-dev-v0.1.json \
--train_batch_size=8 \
--learning_rate=3e-5 \
--num_train_epochs=5 \
--max_seq_length=384 \
--do_lower_case=False \
--doc_stride=128 \
--output_dir=${OUTPUT_DIR} \
--use_tpu=True \
--tpu_name=$TPU_NAME
We use the best model (choose based on the performance on DEV set) on Quoref
to initialize the CorefQA Model.
1.1 Your can you can download the pre-trained mention proposal model (including model, meta and index).
1.2. Or train the mention proposal model yourself.
The script can be found in ./script/model/mention_tpu.sh.
REPO_PATH=/home/shannon/coref-tf
export PYTHONPATH="$PYTHONPATH:$REPO_PATH"
export TPU_NAME=tf-tpu
export TPU_ZONE=europe-west4-a
export GCP_PROJECT=xiaoyli-20-01-4820
BERT_DIR=gs://corefqa_output_quoref/spanbert_large_squad2_best_quoref_1e-5
DATA_DIR=gs://corefqa_data/final_overlap_384_6
OUTPUT_DIR=gs://corefqa_output_mention_proposal/squad_quoref_large_384_6_1e5_8_0.2
python3 ${REPO_PATH}/run/run_mention_proposal.py \
--output_dir=$OUTPUT_DIR \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$BERT_DIR/bert_model.ckpt \
--vocab_file=$BERT_DIR/vocab.txt \
--logfile_path=$OUTPUT_DIR/train.log \
--num_epochs=8 \
--keep_checkpoint_max=50 \
--save_checkpoints_steps=500 \
--train_file=$DATA_DIR/train.corefqa.english.tfrecord \
--dev_file=$DATA_DIR/dev.corefqa.english.tfrecord \
--test_file=$DATA_DIR/test.corefqa.english.tfrecord \
--do_train=True \
--do_eval=False \
--do_predict=False \
--learning_rate=1e-5 \
--dropout_rate=0.2 \
--mention_threshold=0.5 \
--hidden_size=1024 \
--num_docs=5604 \
--window_size=384 \
--num_window=6 \
--max_num_mention=60 \
--start_end_share=False \
--loss_start_ratio=0.3 \
--loss_end_ratio=0.3 \
--loss_span_ratio=0.3 \
--use_tpu=True \
--tpu_name=$TPU_NAME \
--tpu_zone=$TPU_ZONE \
--gcp_project=$GCP_PROJECT \
--num_tpu_cores=1 \
--seed=2333
- Jointly train the mention proposal model and linking model on CoNLL-12.
After getting the best mention proposal model on the dev set, start jointly training the mention proposal and linking tasks.
Run and the script can be found in ./script/model/corefqa_tpu.sh
REPO_PATH=/home/shannon/coref-tf
export PYTHONPATH="$PYTHONPATH:$REPO_PATH"
export TPU_NAME=tf-tpu
export TPU_ZONE=europe-west4-a
export GCP_PROJECT=xiaoyli-20-01-4820
BERT_DIR=gs://corefqa_output_mention_proposal/output_bertlarge
DATA_DIR=gs://corefqa_data/final_overlap_384_6
OUTPUT_DIR=gs://corefqa_output_corefqa/squad_quoref_mention_large_384_6_8e4_8_0.2
python3 ${REPO_PATH}/run/run_corefqa.py \
--output_dir=$OUTPUT_DIR \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$BERT_DIR/best_bert_model.ckpt \
--vocab_file=$BERT_DIR/vocab.txt \
--logfile_path=$OUTPUT_DIR/train.log \
--num_epochs=8 \
--keep_checkpoint_max=50 \
--save_checkpoints_steps=500 \
--train_file=$DATA_DIR/train.corefqa.english.tfrecord \
--dev_file=$DATA_DIR/dev.corefqa.english.tfrecord \
--test_file=$DATA_DIR/test.corefqa.english.tfrecord \
--do_train=True \
--do_eval=False \
--do_predict=False \
--learning_rate=8e-4 \
--dropout_rate=0.2 \
--mention_threshold=0.5 \
--hidden_size=1024 \
--num_docs=5604 \
--window_size=384 \
--num_window=6 \
--max_num_mention=50 \
--start_end_share=False \
--max_span_width=10 \
--max_candiate_mentions=100 \
--top_span_ratio=0.2 \
--max_top_antecedents=30 \
--max_query_len=150 \
--max_context_len=150 \
--sec_qa_mention_score=False \
--use_tpu=True \
--tpu_name=$TPU_NAME \
--tpu_zone=$TPU_ZONE \
--gcp_project=$GCP_PROJECT \
--num_tpu_cores=1 \
--seed=2333
Currently, the evaluation is conducted on a set of saved checkpoints after the training process, and DO NOT support evaluation during training. Please transfer all checkpoints (the output directory is set --output_dir=<path_to_output_directory>
when running the run_<model_sign>.py
) from TPUs to GPUs for evaluation.
This can be achieved by downloading the output directory from the Google Cloud Storage.
The performance on the test set is obtained by using the model achieving the highest F1-score on the dev set.
Set --do_eval=True
、 --do_train=False
and --do_predict=False
to run_<model_sign>.py
and start the evaluation process on a set of saved checkpoints. And other parameters should be the same with the training process.
<model_sign>
should take the value of [mention_proposal, corefqa]
.
The codebase also provides the option of evaluating a single model/checkpoint. Please set --do_eval=False
、 --do_train=False
and --do_predict=True
to run_<model_sign>.py
with the checkpoint path --eval_checkpoint=<path_to_eval_checkpoint_model>
.
<model_sign>
should take the value of [mention_proposal, corefqa]
.
You can download the final CorefQA model at link and follow the instructions in the prediciton to obtain the score reported in the paper.
Name | Descriptions |
---|---|
bert | BERT modules (model,tokenizer,optimization) ref to the google-research/bert repository. |
conll-2012 | offical evaluation scripts for CoNLL2012 shared task. |
data_utils | modules for processing training data. |
func_builders | the input dataloader and model constructor for CorefQA. |
logs | the log files in our experiments. |
models | an implementation of CorefQA/MentionProposal models based on TF. |
run | modules for data preparation and training models. |
scripts/data | scripts for data preparation and loading pretrained models. |
scripts/models | scripts for {train/evaluate} {mention_proposal/corefqa} models on {TPU/GPU}. |
utils | modules including metrics、optimizers. |
Many thanks to Yuxian Meng
and the previous work https://github.com/mandarjoshi90/coref
.
Feel free to discuss papers/code with us through issues/emails!