Lemmatization #1414
Replies: 6 comments
-
Needed for #89 |
Beta Was this translation helpful? Give feedback.
-
For version 2 it's probably ok to have a word cloud that does not distinguish "run" and "runs", just to have some overview of word frequencies in place. |
Beta Was this translation helpful? Give feedback.
-
We might make use of Elasticsearch's analyzers to achieve this. |
Beta Was this translation helpful? Give feedback.
-
We currently have support for stemming, but not lemmatisation. Looks like elasticsearch token filters do not have lemmatisation support, but we could keep an eye open for other solutions. |
Beta Was this translation helpful? Give feedback.
-
I'd vote SpaCy, could be done as part of the NER pipeline even. Just not
sure how easy it is to integrate this into Elasticsearch...
…On Fri, 16 Dec 2022, 12:12 Luka van der Plas, ***@***.***> wrote:
We currently have support for stemming, but not lemmatisation. Looks like
elasticsearch token filters
<https://www.elastic.co/guide/en/elasticsearch/reference/8.5/analysis-tokenfilters.html>
do not have lemmatisation support, but we could keep an eye open for other
solutions.
—
Reply to this email directly, view it on GitHub
<#96 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACVIBOAWKV5YCURUJY5MN4TWNRFAVANCNFSM4EK2LHHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Well, you could always index a field with the already-lemmatised tokens. It may be a bigger issue to figure out what a user-friendly interface for searching lemmatised text would look like. With stemming, we perform stemming on the query, so you match a stemmed query with stemmed text. The main advantage of lemmatisation over stemming is that it distinguishes parts-of-speech, but this means you cannot straightforwardly apply it to a query of some loose words. |
Beta Was this translation helpful? Give feedback.
-
For word cloud etc. This was already implemented in Texcavator.
Beta Was this translation helpful? Give feedback.
All reactions