This repository contains the datasets, codes and experimental results of our paper: Extracting Definienda in Mathematical Scholarly Articles with~Transformers (Accepted by the 2nd Workshop on Information Extraction from Scientific Publications at IJCNLP-AACL 2023).
An 8 minutes' video about our work:
https://youtu.be/tUioJooDDio?si=6EnTN_5l-9t86IKk
And our slides.
If you have no access to our ArXiv papers store, you may start from Step 2.
Get_paper_IDs.ipynb pulls ArXiv IDs of paper in combinatoric category and stores the IDs to a csv file 28477_id+dir.csv.
Then you need to run get_list_of_papers ( make sure that 28477_id+dir.csv, get_paper.sh and add_extthm.py are copied to the same repository ) on the server where you store the ArXiv paper sources to copy the .tex files to a folder "que_tex/".
Always on the same server as in raw data collection step, run
python get_def-term_pairs.py que_tex out_def_all.csv
to print the definitions in all the .tex to a single .csv "out_def_all.csv".
Run Prepare_term-def_dataset.ipynb to clean the noises in extracted definition-definiendum pairs and generate IOB-format dataset for named entity recognition. If you start with this step, you can load out_def_all_1007.csv. You can find our intermediate outputs of this step in intermediate_outputs/ We saved the cleaned and labeled data in data/all_labeled_data+ID.csv.
You may run the following notebooks with our labeled data in "data/":
Transformer's native evaluation of our experiments can be found in finetuning_results/. Our fine-tuned models are available here:
Our results on the test set can be found in GPT_results/Human_corrected_annotations+gpt_res.csv
Run Eval_Finetuning_10-fold.ipynb to align fine-tuned models' predictions with ChatGPT's answers. Our experimental results are in GPT_results/.