Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SymSpell.from(words, {maxDistance: 3, verbosity: 2}) taking long time to load large word dictionary #210

Open
kabyanil opened this issue Nov 22, 2023 · 3 comments
Labels

Comments

@kabyanil
Copy link

I am loading a dictionary of 163,196 words into SymSpell. It is taking more than a minute to load. If I load more than that, the Tauri app crashes. I am wondering, is there any way to load the words once, generate the precalculated delete combinations, and save the generated dictionary for later use? It seems to me that the precalculation step is redundant, running every time the app is launched.

Please share your thoughts on this. Thanks.

@Yomguithereal
Copy link
Owner

Hello @kabyanil, would decreasing maxDistance be a legit option for you? The space/time complexity of symspell indexation is combinatorial if I recall correctly and was not particularly geared towards larger distances. You can serialize the deletion combinations etc. but it means you will increase the size of the data to send by several orders of magnitude which will nullify the benefit of not having to do the computation. In that sense, the raw dictionary is already the most concise version of the index that you can send over the wire. You might also want to try the Passjoin Index that may be less memory-intensive albeit slower in some scenarios, but once again, with a max distance of 3 is already quite high.

Also, and this might not be intuitive, you really want to benchmark against a linear search over the dataset because in some scenarios (very long strings typically), it might be faster.

@kabyanil
Copy link
Author

  1. How do I serialize the deletion combinations?

  2. I am using SQLite to load the dictionary locally, so there are no transactions over the wire. Does that enable any possibilities of faster indexation?

  3. Do you think recurrent neural networks can outperform traditional algorithms in spell checking?

@Yomguithereal
Copy link
Owner

How do I serialize the deletion combinations?

It is not a good idea to serialize deletion combinations. It will not make the index faster to instantiate as computing deletion combinations is not costly per se. It's would be the same as iterating over already generated ones.

I am using SQLite to load the dictionary locally, so there are no transactions over the wire. Does that enable any possibilities of faster indexation?

I don't think so. If I remember correctly, the costly thing here is basically that you need to index a whole lot of strings in a hashmap, which takes time. All Levenshtein-based indices basically need to insert some subspace of possible combinations into a map and this effectively translate time complexity into space complexity. But once again, maxDistance 3 is very ambitious for most of those indices that are all mainly targeting 1-2 maxDistance because you are fighting against combinatorics.

Do you think recurrent neural networks can outperform traditional algorithms in spell checking?

No idea. But I feel this is probably not the case. Have you checked how hunspell and other spell checkers deal with this problem? They usually rely on clever heuristics that alleviate the fact that you would need perfect Levenshtein distance computation to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants