Improved Lemmatization #178
Replies: 1 comment
-
Updated to compare to Example diff, based upon 100 Wikipedia articles in English: If frequency <5ish, quality of spacy isn't always the greatest since it can pick up specific situations e.g. "bit" could mean the past tense of "bite" or "a small amount" or "binary bit". Increasing the number of articles will improve the chances that the lemma is correctly predicted this but it will always be a problem unless context is included (we have ruled this out for now due to installation compatibility). Proper nouns are butchered but this is on purpose since surpressing proper nouns increases the aggressiveness of Spacy for normal words. e.g. "Socrates" becomes "Socrate". This can behaviour can be toggled to preserve proper nouns if required. However, I think that this is the better compromise! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Based upon Discord discussions with zdi, I've made a prototype for building a lemmatization list that could be used with VocabSieve.
Link to script. It should work with any language code that spacy works with. It will automatically download the Wikipedia extracts for your language. https://github.com/jonathanfox5/lemma_from_wiki
I've uploaded some example outputs. Each have been "trained" on 10 articles only.
lemma_en_10.csv
lemma_it_10.csv
lemma_ru_10.csv
Format is
word | lemma | count
. If a word has been assigned multiple lemmas by spacy, it will have multiple rows in the csv. The csv is sorted so that the first row that you come across for a word contains the lemma with the highest frequency.The lemmas have been cleaned for punctuation, numbers, etc but I haven't touched the source words yet.
Beta Was this translation helpful? Give feedback.
All reactions