Skip to content

versusvoid/know-your-roots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

know-your-roots

Extract word to morphemes segmentations from russian wiktionary.

Usage

Prerequisites

  • ruwiktionary dump (grub one)
  • Python 3
  • Internet connection capable of downloading <1mb data from ru.wiktionary.org
  • Patience

When everything is ready, do three simple steps:

  1. Extract dump
  2. $ python3 -m roots.main -D *path to your extracted dump*
  3. Wait

Algorithm

Whole process divided into three steps:

  1. Find in dump, request HTML render from ru.wiktionary.org and save in plain form all declension/conjugation tables (e.g. Шаблон:сущ ru f ina 1d). This step is placed in tables.py.
  2. Find in dump and extract meta-information required for segmentations extraction. This includes 'invocations' of {{{морфо}}} and declension/conjugation table templates. This simple step dwells in meta.py.
  3. Using information from previous two steps extract base form segmentations and align it with derived forms. This functionality is scattered across whole roots.segmentations module.

All three steps can be executed independently. Refer to

$ python3 -m roots.main -h

for further information.

You can also pass -d flag to enable debugging mode and stop on every failed extraction. I'll probably extend it to stop on every extraction some time soon.


For now it capable of extraction ~400k segmentations. Yes, data is VERY noisy, but I'm working on it, I swear!

About

Extract labeled morpheme segmentations from russian wiktionary.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published