Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

French stemmer doesn't work using FTS? #11957

Closed
2 tasks done
charnould opened this issue May 6, 2024 · 3 comments
Closed
2 tasks done

French stemmer doesn't work using FTS? #11957

charnould opened this issue May 6, 2024 · 3 comments

Comments

@charnould
Copy link

charnould commented May 6, 2024

What happens?

First, thanks for making DuckDB!
I might be missing something but french stemmer doesn't seem to work...

To Reproduce

Considering these data in french inside a knowledge.json:

[
    { "id": 1, "content": "J'aime beaucoup les chiens" },
    { "id": 2, "content": "je suis une phrase avec beaucoup de mots inutiles pour ne rien dire" },
    { "id": 3, "content": "Comment payer mon loyer" },
    { "id": 4, "content": "Qu'est ce que le football" }
]

...and this script:

CREATE TABLE knowledge AS SELECT * FROM 'knowledge.json';
PRAGMA create_fts_index('knowledge', 'id', 'content', stemmer = 'french');
FORCE CHECKPOINT;
FROM fts_main_knowledge.dict;

After running the script, table dict (in fts_main_knowledge) contains nothing looking like french stems: aim, chien, inutil, mot phras, loi, pai, football. It should be approximately (considering some stopwords): aimer, chien, phrase, mot, inutile, dire, comment, payer, loyer, football...

Am I missing something, or is it a bug?

Bonus questions linked to FTS:

  1. When I run a query, do I need first to stem my own query ?
  2. Is there a way to retrieve a full stem (e.g. I've a record "He likes dogs", stem is "like dog", and I'd like to get it back to use it somewhere else).

OS:

MBP Apple M1 Silicon

DuckDB Version:

Latest

DuckDB Client:

Any

Full Name:

Charles-Henri Arnould

Affiliation:

None linked to the present issue

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@lnkuiper
Copy link
Contributor

lnkuiper commented May 8, 2024

Hi, we use the snowballstemmer to stem, which supports a bunch of different languages. In the dictionary, you find, for example, aim, not aimer, as this is the stemmed version of aimer. Our stemmer has the same behavior as the snowballstemmer in python:

import snowballstemmer
stemmer = snowballstemmer.stemmer('french')
stemmer.stemWords(['aimer'])
# ['aim']

In DuckDB:

select stem('aimer', 'french') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ aim     │
└─────────┘

Stemming works by reducing words to their base, so that slight changes to words yield the same word, which makes them easier to search:

D select stem('bicycle', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl  │
└─────────┘
D select stem('bicycles', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl  │
└─────────┘

So I think there is some confusion here, because DuckDB's stemmer does exactly this.

  1. You don't need to stem your own query, just use the fts_main_knowledge.match_bm25 macro as explained in the docs.
  2. You can use our tokenize function to stem an entire sentence:
CREATE TABLE knowledge AS SELECT * FROM 'knowledge.json';
PRAGMA create_fts_index('knowledge', 'id', 'content', stemmer = 'french');
SELECT fts_main_knowledge.tokenize('je m''appelle laurens') tokens;
┌───────────────────────────┐
│          tokens           │
│         varchar[]         │
├───────────────────────────┤
│ [je, m, appelle, laurens] │
└───────────────────────────┘

@charnould
Copy link
Author

charnould commented May 11, 2024

Thanks, I was indeed misunderstanding stemming!

EDIT. @lnkuiper : Maybe a last question.

SELECT fts_main_knowledge.tokenize('J''aime beaucoup les chiens') allow to get tokens: [ j, aime, beaucoup, les, chiens ].

But how to get stems?
SELECT fts_main_knowledge.stem('J aime beaucoup les chiens', 'french') does not work.
Thanks again.

@lnkuiper
Copy link
Contributor

@charnould, When you use the tokenize function, you get a list of stemmed words. The stem function only works on individual words, not sentences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants