HAMOD

a High Agreement Multi-lingual Outlier Detection dataset

Data

This site hosts a multi-lingual dataset comprising manually prepared data suitable for carrying out the outlier detection exercise. Outlier detection is a task of selecting an outlier, a word that does not fit to the set of given words based on some (typically semantic) criteria.

Examples

blue, red, green, yellow, orange, black, brown, white, table. Obviously, the last word is the outlier: all the others are names of colours.
bricklayer, lawyer, shop assistant, gentleman, waitress, metheorologist. Gentleman is the outlier word because it is not a job.

Dataset format

The dataset consists of plain text files, each containing 8 lines of the in-domain words, a blank line, and 8 lines of outlier words. Each file therefore represents 8 exercises (by taking the 8 in-domain words and adding one out of the 8 outliers) for sets of 9 words, or many more exercises for shorter sets (such as 64 exercises for sets of 8 words, by choosing one out of the 8 outliers and removing one out of the 8 in-domain words). An example file for English covering the set of birds:

swan
duck
seagull
eagle
dove
crow
stork
goose

monkey
salmon
grasshopper
fly
egg
plane
woman
cliff

Motivation

The outlier detection task features very high agreement (typically over 90%) among human annotators and can be used e.g. for the evaluation of distributional thesauri. Please read the papers referenced below for all the details.

Languages

At the moment the dataset consists of the following languages:

Czech
German
English
Estonian
French
Italian
Slovak

If you would like to collaborate with us on adding a new language, please use the contact below.

Authors

This dataset was developed by Lexical Computing, particularly by (in alphabetical order) Michal Cukr, Ondřej Herman, Miloš Jakubíček, Vojtěch Kovář, Emma Romani and Pavel Rychlý.

Contact

Please use [email protected] for any questions or requests.

License

The dataset is licensed under the CC-BY-SA 4.0 license. Attribution in any research context shall be carried out by properly citing the papers referenced below. We would appreciate if you let us know about any derived work.

How to cite

Please cite:

Romani, E. (2022). Building A Multilingual Outlier Detection Dataset For The Evaluation Of Distributional Thesauri And Word Embeddings. Master's thesis, University of Pavia. PDF

BibTex:

@mastersthesis{hamod_thesis,
  title={Building A Multilingual Outlier Detection Dataset For The Evaluation Of Distributional Thesauri And Word Embeddings},
  author={Emma, Romani},
  school={The University of Pavia},
  year={2022}
}

Jakubíček, M., Romani, E., Rychlý, P., & Herman, O. (2021). Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset. In RASLAN 2021 Recent Advances in Slavonic Natural Language Processing, 177. PDF

BibTex:

@inproceedings{hamod,
  title={Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset},
  author={Jakubíček, Miloš and Romani, Emma and Rychlý, Pavel and Herman, Ondřej},
  booktitle={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021},
  year={2021},
  pages={177--183},
  publisher={Tribun EU}
}

Rychlý, P. (2019). Evaluation of Czech Distributional Thesauri. In RASLAN 2019 Recent Advances in Slavonic Natural Language Processing, 137. PDF

BibTex:

@inproceedings{thesaurievaluation,
  title={Evaluation of Czech Distributional Thesauri},
  author={Rychlý, Pavel},
  booktitle={Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019},
  pages={137--142},
  year={2019},
  publisher={Tribun EU}
}

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
data		data
.gitignore		.gitignore
HAMODdataset_guidelines.txt		HAMODdataset_guidelines.txt
LICENSE		LICENSE
README.md		README.md
android-chrome-192x192.png		android-chrome-192x192.png
android-chrome-256x256.png		android-chrome-256x256.png
apple-touch-icon.png		apple-touch-icon.png
browserconfig.xml		browserconfig.xml
elexis_logo_color-1.png		elexis_logo_color-1.png
euflag.jpg		euflag.jpg
exercise.cgi		exercise.cgi
exercise.py		exercise.py
exercise.tag		exercise.tag
favicon-16x16.png		favicon-16x16.png
favicon-32x32.png		favicon-32x32.png
favicon.ico		favicon.ico
favicon.png		favicon.png
favicon.svg		favicon.svg
index.html		index.html
init_db.py		init_db.py
lc_logo-300x119.png		lc_logo-300x119.png
main.tag		main.tag
mstile-150x150.png		mstile-150x150.png
new.tag		new.tag
od_eval_embed.py		od_eval_embed.py
progressbar.min.js		progressbar.min.js
safari-pinned-tab.svg		safari-pinned-tab.svg
site.webmanifest		site.webmanifest
spinner.tag		spinner.tag
txtvec2bin.py		txtvec2bin.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAMOD

Data

Examples

Dataset format

Motivation

Languages

Authors

Contact

License

How to cite

About

Releases

Packages

Contributors 3

Languages

License

lexicalcomputing/hamod

Folders and files

Latest commit

History

Repository files navigation

HAMOD

Data

Examples

Dataset format

Motivation

Languages

Authors

Contact

License

How to cite

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages