French stemmer doesn't work using FTS? #11957

charnould · 2024-05-06T15:44:39Z

What happens?

First, thanks for making DuckDB!
I might be missing something but french stemmer doesn't seem to work...

To Reproduce

Considering these data in french inside a knowledge.json:

[
    { "id": 1, "content": "J'aime beaucoup les chiens" },
    { "id": 2, "content": "je suis une phrase avec beaucoup de mots inutiles pour ne rien dire" },
    { "id": 3, "content": "Comment payer mon loyer" },
    { "id": 4, "content": "Qu'est ce que le football" }
]

...and this script:

CREATE TABLE knowledge AS SELECT * FROM 'knowledge.json';
PRAGMA create_fts_index('knowledge', 'id', 'content', stemmer = 'french');
FORCE CHECKPOINT;
FROM fts_main_knowledge.dict;

After running the script, table dict (in fts_main_knowledge) contains nothing looking like french stems: aim, chien, inutil, mot phras, loi, pai, football. It should be approximately (considering some stopwords): aimer, chien, phrase, mot, inutile, dire, comment, payer, loyer, football...

Am I missing something, or is it a bug?

Bonus questions linked to FTS:

When I run a query, do I need first to stem my own query ?
Is there a way to retrieve a full stem (e.g. I've a record "He likes dogs", stem is "like dog", and I'd like to get it back to use it somewhere else).

OS:

MBP Apple M1 Silicon

DuckDB Version:

Latest

DuckDB Client:

Any

Full Name:

Charles-Henri Arnould

Affiliation:

None linked to the present issue

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

lnkuiper · 2024-05-08T14:03:10Z

Hi, we use the snowballstemmer to stem, which supports a bunch of different languages. In the dictionary, you find, for example, aim, not aimer, as this is the stemmed version of aimer. Our stemmer has the same behavior as the snowballstemmer in python:

import snowballstemmer
stemmer = snowballstemmer.stemmer('french')
stemmer.stemWords(['aimer'])
# ['aim']

In DuckDB:

select stem('aimer', 'french') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ aim     │
└─────────┘

Stemming works by reducing words to their base, so that slight changes to words yield the same word, which makes them easier to search:

D select stem('bicycle', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl  │
└─────────┘
D select stem('bicycles', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl  │
└─────────┘

So I think there is some confusion here, because DuckDB's stemmer does exactly this.

You don't need to stem your own query, just use the fts_main_knowledge.match_bm25 macro as explained in the docs.
You can use our tokenize function to stem an entire sentence:

CREATE TABLE knowledge AS SELECT * FROM 'knowledge.json';
PRAGMA create_fts_index('knowledge', 'id', 'content', stemmer = 'french');
SELECT fts_main_knowledge.tokenize('je m''appelle laurens') tokens;
┌───────────────────────────┐
│          tokens           │
│         varchar[]         │
├───────────────────────────┤
│ [je, m, appelle, laurens] │
└───────────────────────────┘

charnould · 2024-05-11T07:10:13Z

Thanks, I was indeed misunderstanding stemming!

EDIT. @lnkuiper : Maybe a last question.

SELECT fts_main_knowledge.tokenize('J''aime beaucoup les chiens') allow to get tokens: [ j, aime, beaucoup, les, chiens ].

But how to get stems?
SELECT fts_main_knowledge.stem('J aime beaucoup les chiens', 'french') does not work.
Thanks again.

lnkuiper · 2024-05-13T07:22:36Z

@charnould, When you use the tokenize function, you get a list of stemmed words. The stem function only works on individual words, not sentences

charnould added the needs triage label May 6, 2024

szarnyasg added the reproduced label May 7, 2024

duckdblabs-bot removed the needs triage label May 7, 2024

charnould closed this as completed May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

French stemmer doesn't work using FTS? #11957

French stemmer doesn't work using FTS? #11957

charnould commented May 6, 2024 •

edited by szarnyasg

lnkuiper commented May 8, 2024 •

edited

charnould commented May 11, 2024 •

edited

lnkuiper commented May 13, 2024

French stemmer doesn't work using FTS? #11957

French stemmer doesn't work using FTS? #11957

Comments

charnould commented May 6, 2024 • edited by szarnyasg

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

lnkuiper commented May 8, 2024 • edited

charnould commented May 11, 2024 • edited

lnkuiper commented May 13, 2024

charnould commented May 6, 2024 •

edited by szarnyasg

lnkuiper commented May 8, 2024 •

edited

charnould commented May 11, 2024 •

edited