SymSpell.from(words, {maxDistance: 3, verbosity: 2}) taking long time to load large word dictionary #210

kabyanil · 2023-11-22T05:43:51Z

I am loading a dictionary of 163,196 words into SymSpell. It is taking more than a minute to load. If I load more than that, the Tauri app crashes. I am wondering, is there any way to load the words once, generate the precalculated delete combinations, and save the generated dictionary for later use? It seems to me that the precalculation step is redundant, running every time the app is launched.

Please share your thoughts on this. Thanks.

Yomguithereal · 2023-11-22T09:05:10Z

Hello @kabyanil, would decreasing maxDistance be a legit option for you? The space/time complexity of symspell indexation is combinatorial if I recall correctly and was not particularly geared towards larger distances. You can serialize the deletion combinations etc. but it means you will increase the size of the data to send by several orders of magnitude which will nullify the benefit of not having to do the computation. In that sense, the raw dictionary is already the most concise version of the index that you can send over the wire. You might also want to try the Passjoin Index that may be less memory-intensive albeit slower in some scenarios, but once again, with a max distance of 3 is already quite high.

Also, and this might not be intuitive, you really want to benchmark against a linear search over the dataset because in some scenarios (very long strings typically), it might be faster.

kabyanil · 2023-11-24T16:20:40Z

How do I serialize the deletion combinations?
I am using SQLite to load the dictionary locally, so there are no transactions over the wire. Does that enable any possibilities of faster indexation?
Do you think recurrent neural networks can outperform traditional algorithms in spell checking?

Yomguithereal · 2023-11-27T12:43:40Z

How do I serialize the deletion combinations?

It is not a good idea to serialize deletion combinations. It will not make the index faster to instantiate as computing deletion combinations is not costly per se. It's would be the same as iterating over already generated ones.

I am using SQLite to load the dictionary locally, so there are no transactions over the wire. Does that enable any possibilities of faster indexation?

I don't think so. If I remember correctly, the costly thing here is basically that you need to index a whole lot of strings in a hashmap, which takes time. All Levenshtein-based indices basically need to insert some subspace of possible combinations into a map and this effectively translate time complexity into space complexity. But once again, maxDistance 3 is very ambitious for most of those indices that are all mainly targeting 1-2 maxDistance because you are fighting against combinatorics.

Do you think recurrent neural networks can outperform traditional algorithms in spell checking?

No idea. But I feel this is probably not the case. Have you checked how hunspell and other spell checkers deal with this problem? They usually rely on clever heuristics that alleviate the fact that you would need perfect Levenshtein distance computation to work.

Yomguithereal added the question label Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SymSpell.from(words, {maxDistance: 3, verbosity: 2}) taking long time to load large word dictionary #210

SymSpell.from(words, {maxDistance: 3, verbosity: 2}) taking long time to load large word dictionary #210

kabyanil commented Nov 22, 2023

Yomguithereal commented Nov 22, 2023

kabyanil commented Nov 24, 2023

Yomguithereal commented Nov 27, 2023

SymSpell.from(words, {maxDistance: 3, verbosity: 2}) taking long time to load large word dictionary #210

SymSpell.from(words, {maxDistance: 3, verbosity: 2}) taking long time to load large word dictionary #210

Comments

kabyanil commented Nov 22, 2023

Yomguithereal commented Nov 22, 2023

kabyanil commented Nov 24, 2023

Yomguithereal commented Nov 27, 2023