Stemmers for Ukrainian

This repository introduces a new stemmer for the Ukrainian language (tree_stem) created via machine learning. It outperforms all other stemmers available to date as well as some lemmatizers by the error rate relative to truncation (ERRT) (Paice 1994). It also has the lowest percentage of understemming errors compared to the available stemming algorithms.

The proposed algorithm does not use dictionary lookups while maintaining a reasonably small size (48 KB of Python bytecode). It works faster than lemmatization approach by a factor of x24, and outperforms other stemming algorithms in speed as well.

In addition to the new algorithm, this repository also contains Python ports of some of the previously published stemmers.

Comparison of stemmers for the Ukrainian language

Stemmer	Languages	UI	OI	ERRT
Dictionary-based (reference)	–	0.0172	3.59e-06	0.0244
tree_stem	Python	0.0907	2.71e-06	0.125
pymorphy2 (Paper)	Python	0.324	2.01e-07	0.391
stemka	C++	0.329	2.34e-06	0.447
tapkomet	Snowball, C, Java	0.447	2.73e-06	0.603
vgrichina	Groovy, Python	0.497	1.05e-06	0.651
drupal	JS, Python	0.511	7.54e-07	0.666
tochytskyi (Paper)	PHP, Python	0.623	5.72e-07	0.795
No stemming	–	1.00	1.69e-08	–

where:

UI – understemming index
OI – overstemming index
ERRT – error rate relative to truncation

Notes:

pymorphy2 is a dictionary-assisted lemmatizer and morphological analyzer which was included into this comparison for reference. The most probable normal form is used as a stem.
training and testing was performed on a dictionary of word forms.

References

Paice, C. (1994). An Evaluation Method for Stemming Algorithms. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 42-50.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Stemmers for Ukrainian

Comparison of stemmers for the Ukrainian language

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Stemmers for Ukrainian

Comparison of stemmers for the Ukrainian language

References