Search is case sensitive for non-ASCII messages #5052

link2xt · 2023-11-27T19:39:46Z

In SQLite search with LIKE is case-insensitive:

Line 933 in b779d08

AND txt LIKE ?

This however does not work for non-english letters. Using functions like UPPER also does not help as they only work for ASCII and using a function on txt column will prevent using indexes anyway.

It is a well-known problem:
https://shallowdepth.online/posts/2022/01/5-ways-to-implement-case-insensitive-search-in-sqlite-with-full-unicode-support/

The solution is to create a new column txt_normalized defaulting to NULL (so we do not bump up the database size in a migration) and storing lowercased/normalized text there when the row is created. When doing a search, search over IFNULL(txt_normalized, txt).

The text was updated successfully, but these errors were encountered:

r10s · 2023-11-27T20:08:26Z

that would double the space needed for stored text, maybe okay when compared to what blobs need.

what would be another consideration for having an extra field for searching is that we could also add other things there - as names of files or webxdc apps - currently, you cannot search for that. appending these information to txt_normalized can make that accessible without degrading performance

that said, there is also FULLTEXT and ICU extensions, that may be alternatives (ICU is mentioned in the linked post to be complicated, however, not sure if that also applied to rust). for FULLTEXT, it is still astonishing how far and fast we can go without

link2xt · 2023-11-27T20:27:22Z

ICU is mentioned in the linked post to be complicated, however, not sure if that also applied to rust

It is complicated, we currently just use vendored SQLCipher and managing our own version of SQLite with SQLCipher and additional extensions is going to make compiling the core more difficult for everyone. There is an issue in rusqlite repo: rusqlite/rusqlite#531

link2xt · 2023-12-17T12:36:57Z

This migration passes by the way:

+    if dbversion < 106 {
+        sql.execute_migration(
+            r"CREATE VIRTUAL TABLE search USING fts5(text)",
+            106
+        ).await?;
+    }

So we can use https://www.sqlite.org/fts5.html

iequidoo · 2024-03-04T18:52:53Z

Tried FTS5, but this approach has its own problems. We should use MATCH instead of LIKE to perform a case-insensitive search, but then we lose a capability of matching against parts of words (if we use the unicode61 tokeniser). We can use the trigram tokeniser then for substring matching, but anyway MATCH should be used for case-insensitive search (it's not stated explicitly in the docs, but i failed to get LIKE working). That means we should escape the search pattern correctly (when using LIKE we just pass %pattern% f.e.). And also queries of fewer than 3 chars wouldn't work:

Substrings consisting of fewer than 3 unicode characters do not match any rows when used with a full-text query.

I think the most simple way that doesn't break the current UX also is what @link2xt suggested initially:

The solution is to create a new column txt_normalized defaulting to NULL (so we do not bump up the database size in a migration) and storing lowercased/normalized text there when the row is created. When doing a search, search over IFNULL(txt_normalized, txt).

Moreover, if a normalised text doesn't differ from the source one or contains only ASCII, we can omit storing it thus reducing the db size.

EDIT: Note that matching against parts of words is critical for CJK langs, i don't know if there are any tokenisers for them.

SQLite search with `LIKE` is case-insensitive only for ASCII chars. To make it case-insensitive for all messages, create a new column `msgs.txt_normalized` defaulting to `NULL` (so we do not bump up the database size in a migration) and storing lowercased/normalized text there when the row is created/updated. When doing a search, search over `IFNULL(txt_normalized, txt)`.

link2xt added the enhancement New feature or request label Nov 27, 2023

iequidoo self-assigned this Mar 4, 2024

iequidoo added a commit that referenced this issue Mar 4, 2024

feat: Try SQLite FTS5 (#5052)

15df579

iequidoo mentioned this issue Mar 5, 2024

feat: Case-insensitive search for non-ASCII messages (#5052) #5321

Merged

iequidoo closed this as completed in #5321 Jun 17, 2024

iequidoo mentioned this issue Jul 30, 2024

Case sensitivity considered when searching for words/terms that include German umlauts #5816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search is case sensitive for non-ASCII messages #5052

Search is case sensitive for non-ASCII messages #5052

link2xt commented Nov 27, 2023

r10s commented Nov 27, 2023 •

edited

Loading

link2xt commented Nov 27, 2023

link2xt commented Dec 17, 2023 •

edited

Loading

iequidoo commented Mar 4, 2024 •

edited

Loading

Search is case sensitive for non-ASCII messages #5052

Search is case sensitive for non-ASCII messages #5052

Comments

link2xt commented Nov 27, 2023

r10s commented Nov 27, 2023 • edited Loading

link2xt commented Nov 27, 2023

link2xt commented Dec 17, 2023 • edited Loading

iequidoo commented Mar 4, 2024 • edited Loading

r10s commented Nov 27, 2023 •

edited

Loading

link2xt commented Dec 17, 2023 •

edited

Loading

iequidoo commented Mar 4, 2024 •

edited

Loading