New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
French stemmer doesn't work using FTS? #11957
Comments
Hi, we use the snowballstemmer to stem, which supports a bunch of different languages. In the dictionary, you find, for example, import snowballstemmer
stemmer = snowballstemmer.stemmer('french')
stemmer.stemWords(['aimer'])
# ['aim'] In DuckDB: select stem('aimer', 'french') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ aim │
└─────────┘ Stemming works by reducing words to their base, so that slight changes to words yield the same word, which makes them easier to search: D select stem('bicycle', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl │
└─────────┘
D select stem('bicycles', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl │
└─────────┘ So I think there is some confusion here, because DuckDB's stemmer does exactly this.
CREATE TABLE knowledge AS SELECT * FROM 'knowledge.json';
PRAGMA create_fts_index('knowledge', 'id', 'content', stemmer = 'french');
SELECT fts_main_knowledge.tokenize('je m''appelle laurens') tokens;
┌───────────────────────────┐
│ tokens │
│ varchar[] │
├───────────────────────────┤
│ [je, m, appelle, laurens] │
└───────────────────────────┘ |
Thanks, I was indeed misunderstanding stemming! EDIT. @lnkuiper : Maybe a last question.
But how to get stems? |
@charnould, When you use the |
What happens?
First, thanks for making DuckDB!
I might be missing something but french stemmer doesn't seem to work...
To Reproduce
Considering these data in french inside a
knowledge.json
:...and this script:
After running the script, table
dict
(infts_main_knowledge
) contains nothing looking like french stems:aim, chien, inutil, mot phras, loi, pai, football
. It should be approximately (considering some stopwords):aimer, chien, phrase, mot, inutile, dire, comment, payer, loyer, football
...Am I missing something, or is it a bug?
Bonus questions linked to FTS:
OS:
MBP Apple M1 Silicon
DuckDB Version:
Latest
DuckDB Client:
Any
Full Name:
Charles-Henri Arnould
Affiliation:
None linked to the present issue
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: