know-your-roots

Extract word to morphemes segmentations from russian wiktionary.

Usage

Prerequisites

ruwiktionary dump (grub one)
Python 3
Internet connection capable of downloading <1mb data from ru.wiktionary.org
Patience

When everything is ready, do three simple steps:

Extract dump
$ python3 -m roots.main -D *path to your extracted dump*
Wait

Algorithm

Whole process divided into three steps:

Find in dump, request HTML render from ru.wiktionary.org and save in plain form all declension/conjugation tables (e.g. Шаблон:сущ ru f ina 1d). This step is placed in tables.py.
Find in dump and extract meta-information required for segmentations extraction. This includes 'invocations' of {{{морфо}}} and declension/conjugation table templates. This simple step dwells in meta.py.
Using information from previous two steps extract base form segmentations and align it with derived forms. This functionality is scattered across whole roots.segmentations module.

All three steps can be executed independently. Refer to

$ python3 -m roots.main -h

for further information.

You can also pass -d flag to enable debugging mode and stop on every failed extraction. I'll probably extend it to stop on every extraction some time soon.

For now it capable of extraction ~400k segmentations. Yes, data is VERY noisy, but I'm working on it, I swear!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
roots		roots
stanford-crf		stanford-crf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
convert-to-conll.py		convert-to-conll.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

know-your-roots

Usage

Prerequisites

Algorithm

About

Releases

Packages

Languages

License

versusvoid/know-your-roots

Folders and files

Latest commit

History

Repository files navigation

know-your-roots

Usage

Prerequisites

Algorithm

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages