How to tokenize correctly for BM25 search? #1752
-
What happens?I've read these docs: but it's unclear to me how one should search for a real life question with BM25 (Bag Of Words, not exact phrase matching). I'm testing on a wordpress-forum related dataset, here are some examples. Input question: How should I turn this question to a BM25 search object? Because the straightforward SELECT id, doctext FROM paradedbbm25.search( query => paradedb.parse('Add\ filename\ to\ attachment\ page\ url'), limit_rows => 5 ); will not find documents with words So I need to tokenize the question, but I ran into issues with apostrophe escaping: I made a simple split-on-whitespace tokenizer that escapes the special characters https://docs.paradedb.com/documentation/full-text/term#special-characters (Note: apostrophe ' is not on the list) , but SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn''t change menu in browser)', limit_rows => 5 ); results in ParseError/SyntaxError, and SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn\'t change menu in browser)', limit_rows => 5 ); results in syntax error at or near "t" (in doesn't). How should I tokenize a question containing apostrophe to be used with BM25 search? How could I use the same tokenizer that was used in paradedb.create_bm25() ? (because if the question is tokenized with a different method than create_bm25(), then there's a risk of missing relevant words in the bag-of-words model and losing accuracy) To Reproduce
SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn''t change menu in browser)', limit_rows => 5 ); results in ParseError/SyntaxError, and SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn\'t change menu in browser)', limit_rows => 5 ); results in syntax error at or near "t" (in doesn't). OS:Ubuntu LTS in Colab ParadeDB Version:releases/download/v0.10.2/postgresql-16-pg-search_0.10.2-1PARADEDB-jammy_amd64.deb Are you using ParadeDB Docker, Helm, or the extension(s) standalone?ParadeDB pg_search Extension Full Name:András Jankovics Affiliation:András Jankovics Did you include all relevant data sets for reproducing the issue?Yes Did you include the code required to reproduce the issue?
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I've converted this into issue #1759 since I suspect it is a bug, will do the discussion there |
Beta Was this translation helpful? Give feedback.
I've converted this into issue #1759 since I suspect it is a bug, will do the discussion there