Dataset Example

license

task_categories

language

Dataset Example

Due to the complexity of the dataset, we replace the dataset preview with an example figure. The text is translated into English for comprehension (*), however, unlike the figure, the dataset does not include English-translated text, only containing Korean and Hanja. Also, only one relation is shown for readability.

Relation information includes (i) subject and object entities for Korean and Hanja (sbj_kor, sbj_han, obj_kor, obj_han), (ii) a relation type (label), (iii) evidence sentence index(es) for each language (evidence_kor, evidence_han). Metadata contains additional information, such as which book the text is extracted from.

Corpus of HistRED: << Yeonhaengnok >>

In this dataset, we choose Yeonhaengnok, a collection of records originally written in Hanja, classical Chinese writing, which has later been translated into Korean. Joseon, the last dynastic kingdom of Korea, lasted just over five centuries, from 1392 to 1897, and many aspects of Korean traditions and customs trace their roots back to this era. Numerous historical documents exist from the Joseon dynasty, including Annals of Joseon Dynasty (AJD) and Diaries of the Royal Secretariats (DRS). Note that the majority of Joseon's records were written in Hanja, the archaic Chinese writing that differs from modern Chinese because the Korean language had not been standardized until much later.

In short, Yeonhaengnok is a travel diary from the Joseon period. In the past, traveling to other places, particularly to foreign countries, was rare. Therefore, intellectuals who traveled to Chung (also referred to as the Qing dynasty) meticulously documented their journeys, and Yeonhaengnok is a compilation of these accounts. Diverse individuals from different generations recorded their business trips following similar routes from Joseon to Chung, focusing on people, products, and events they encountered. The Institute for the Translation of Korean Classics (ITKC) has open-sourced the original and their translated texts for many historical documents, promoting active historical research. The entire documents were collected from an open-source database at https://db.itkc.or.kr/.

Properties

Our dataset contains (i) named entities, (ii) relations between the entities, and (iii) parallel relationships between Korean and Hanja texts.
dataset.py return processed dataset that can be easily applied to general NLP models.
- For monolingual setting: KoreanDataset, HanjaDataset
- For Bilingual setting: JointDataset
ner_map.json , label_map.json is a mapping dictionary from label class to index.
Sequence level (SL) is a unit of sequence length for extracting self-contained sub-texts without losing context information for each relation in the text. Each folder SL-k indicates that SL is k.

Dataset usages

Testbed for evaluating the model performance when varying the sequence length.
Relation extraction task especially on Non-English or historical corpus.

Citation

@inproceedings{yang-etal-2023-histred,
    title = "{H}ist{RED}: A Historical Document-Level Relation Extraction Dataset",
    author = "Yang, Soyoung  and
      Choi, Minseok  and
      Cho, Youngwoo  and
      Choo, Jaegul",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.180",
    pages = "3207--3224",
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
dataset.py		dataset.py
example.png		example.png
label_map.json		label_map.json
ner_map.json		ner_map.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dataset.py

dataset.py

example.png

example.png

label_map.json

label_map.json

ner_map.json

ner_map.json

Repository files navigation

Dataset Example

Corpus of HistRED: << Yeonhaengnok >>

Properties

Dataset usages

Citation

About

Releases

Packages

Languages

brightjade/HistRED

Folders and files

Latest commit

History

Repository files navigation

Dataset Example

Corpus of HistRED: << Yeonhaengnok >>

Properties

Dataset usages

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages