-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant and miscategorized stems in apertium-kaz.kaz.lexc #11
Comments
Instead of going over the list of stems found in kaz.lexc and checking them, I decided to start with surface forms from a frequency list made out of the My logic here was that:
Stiil, all stems currently in apertium-kaz.kaz.lexc will have to be checked. Once I'm done with surface forms from the Little Prince (and maybe the public domain subset of kitap.kz), I'll just take the difference of the wordlist in https://github.com/taruen/apertiumpp/blob/master/data4apertium/vocabulary/kaz.rkt and stemlist in apertium-kaz.kaz.lexc as what remains to be checked. This is a reminder for myself to do that. |
…tium/vocabulary/kaz.rkt
…tium/vocabulary/kaz.rkt
Note, a GCI student wrote a lexc parser and lexicon deduplicator a couple years ago. Let me know if you want help digging it up. |
Relevant tools: apertium/apertium-on-github#51 |
Turns out that the explanatory dictionary of Kazakh has been put online kitap.kz. So the task is, at the minimum, to check POS of apertium-kaz.kaz.lexc stems with that dictionary. However, that dictionary might be under some CC license, as other things on kitap.kz seem to be. If it is, then example sentences and explanations could be used in the apertium project too. I'll need to figure out which particular license that dictionary is published under. Also see: https://yvision.kz/post/416129 |
…ed', 'abbreviations', 'punctuation', 'proper' 2. sort entries alphabetically #11
…trailing whitespaces
The vocabulary of
apertium-kaz.kaz.lexc
requires checking for redundancy, consistency and miscategorizations. Here are some examples:Along with that, reasons why these are considered mistakes, and, generally, choices made should be documented in
apertium-kaz/docs
so that this kind of issues don't happen in the future.At that point, (since the coverage of
apertium-kaz
is relatively high, that documentation will probably be more useful for other (Turkic) languages rather than for Kazakh.The text was updated successfully, but these errors were encountered: