Extract word to morphemes segmentations from russian wiktionary.
- ruwiktionary dump (grub one)
- Python 3
- Internet connection capable of downloading <1mb data from ru.wiktionary.org
- Patience
When everything is ready, do three simple steps:
- Extract dump
$ python3 -m roots.main -D *path to your extracted dump*
- Wait
Whole process divided into three steps:
- Find in dump, request HTML render from ru.wiktionary.org and save in plain form all declension/conjugation tables (e.g. Шаблон:сущ ru f ina 1d). This step is placed in tables.py.
- Find in dump and extract meta-information required for segmentations extraction. This includes 'invocations' of {{{морфо}}} and declension/conjugation table templates. This simple step dwells in meta.py.
- Using information from previous two steps extract base form segmentations and align it with derived forms. This functionality is scattered across whole roots.segmentations module.
All three steps can be executed independently. Refer to
$ python3 -m roots.main -h
for further information.
You can also pass -d
flag to enable debugging mode and stop on every failed extraction.
I'll probably extend it to stop on every extraction some time soon.
For now it capable of extraction ~400k segmentations. Yes, data is VERY noisy, but I'm working on it, I swear!