Improved Lemmatization #178

jonathanfox5 · 2024-12-07T21:18:33Z

jonathanfox5
Dec 7, 2024

Based upon Discord discussions with zdi, I've made a prototype for building a lemmatization list that could be used with VocabSieve.

Link to script. It should work with any language code that spacy works with. It will automatically download the Wikipedia extracts for your language. https://github.com/jonathanfox5/lemma_from_wiki

I've uploaded some example outputs. Each have been "trained" on 10 articles only.
lemma_en_10.csv
lemma_it_10.csv
lemma_ru_10.csv

Format is word | lemma | count. If a word has been assigned multiple lemmas by spacy, it will have multiple rows in the csv. The csv is sorted so that the first row that you come across for a word contains the lemma with the highest frequency.

The lemmas have been cleaned for punctuation, numbers, etc but I haven't touched the source words yet.

jonathanfox5 · 2024-12-08T16:44:24Z

jonathanfox5
Dec 8, 2024
Author

Updated to compare to simplemma and, optionally, only show the differences if `--diff" is passed.

Example diff, based upon 100 Wikipedia articles in English:
lemma_en_100__diff.csv

If frequency <5ish, quality of spacy isn't always the greatest since it can pick up specific situations e.g. "bit" could mean the past tense of "bite" or "a small amount" or "binary bit". Increasing the number of articles will improve the chances that the lemma is correctly predicted this but it will always be a problem unless context is included (we have ruled this out for now due to installation compatibility).

Proper nouns are butchered but this is on purpose since surpressing proper nouns increases the aggressiveness of Spacy for normal words. e.g. "Socrates" becomes "Socrate". This can behaviour can be toggled to preserve proper nouns if required. However, I think that this is the better compromise!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved Lemmatization #178

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improved Lemmatization #178

Uh oh!

jonathanfox5 Dec 7, 2024

Replies: 1 comment

Uh oh!

jonathanfox5 Dec 8, 2024 Author

jonathanfox5
Dec 7, 2024

jonathanfox5
Dec 8, 2024
Author