Skip to content
forked from dudrrm/HistRED

Source code for paper "HistRED: A Historical Document-Level Relation Extraction Dataset", ACL 2023

Notifications You must be signed in to change notification settings

brightjade/HistRED

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

license task_categories language tags size_categories
cc-by-nc-nd-4.0
token-classification
ko
art
1K<n<10K

This is the official code for HistRED: A Historical Document-Level Relation Extraction Dataset (ACL 2023). All materials related to this paper can be found here.

  • ACL Anthology: Official proceeding publication
  • Virtual-ACL 2023: You can view papers, posters, and presentation slides.
  • arXiv: This is the camera-ready version, which is a key part of this paper.

Note that this dataset is open under CC BY-NC-ND 4.0 license. The dataset exists in HuggingFaceDataset.

from datasets import load_dataset

dataset = load_dataset("Soyoung/HistRED")

Dataset Example

Due to the complexity of the dataset, we replace the dataset preview with an example figure. The text is translated into English for comprehension (*), however, unlike the figure, the dataset does not include English-translated text, only containing Korean and Hanja. Also, only one relation is shown for readability.

Relation information includes (i) subject and object entities for Korean and Hanja (sbj_kor, sbj_han, obj_kor, obj_han), (ii) a relation type (label), (iii) evidence sentence index(es) for each language (evidence_kor, evidence_han). Metadata contains additional information, such as which book the text is extracted from. image

Corpus of HistRED: << Yeonhaengnok >>

In this dataset, we choose Yeonhaengnok, a collection of records originally written in Hanja, classical Chinese writing, which has later been translated into Korean. Joseon, the last dynastic kingdom of Korea, lasted just over five centuries, from 1392 to 1897, and many aspects of Korean traditions and customs trace their roots back to this era. Numerous historical documents exist from the Joseon dynasty, including Annals of Joseon Dynasty (AJD) and Diaries of the Royal Secretariats (DRS). Note that the majority of Joseon's records were written in Hanja, the archaic Chinese writing that differs from modern Chinese because the Korean language had not been standardized until much later.

In short, Yeonhaengnok is a travel diary from the Joseon period. In the past, traveling to other places, particularly to foreign countries, was rare. Therefore, intellectuals who traveled to Chung (also referred to as the Qing dynasty) meticulously documented their journeys, and Yeonhaengnok is a compilation of these accounts. Diverse individuals from different generations recorded their business trips following similar routes from Joseon to Chung, focusing on people, products, and events they encountered. The Institute for the Translation of Korean Classics (ITKC) has open-sourced the original and their translated texts for many historical documents, promoting active historical research. The entire documents were collected from an open-source database at https://db.itkc.or.kr/.

Properties

  • Our dataset contains (i) named entities, (ii) relations between the entities, and (iii) parallel relationships between Korean and Hanja texts.
  • dataset.py return processed dataset that can be easily applied to general NLP models.
    • For monolingual setting: KoreanDataset, HanjaDataset
    • For Bilingual setting: JointDataset
  • ner_map.json , label_map.json is a mapping dictionary from label class to index.
  • Sequence level (SL) is a unit of sequence length for extracting self-contained sub-texts without losing context information for each relation in the text. Each folder SL-k indicates that SL is k.

Dataset usages

  • Testbed for evaluating the model performance when varying the sequence length.
  • Relation extraction task especially on Non-English or historical corpus.

Citation

@inproceedings{yang-etal-2023-histred,
    title = "{H}ist{RED}: A Historical Document-Level Relation Extraction Dataset",
    author = "Yang, Soyoung  and
      Choi, Minseok  and
      Cho, Youngwoo  and
      Choo, Jaegul",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.180",
    pages = "3207--3224",
}

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%