Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial_ratio freezes on extremely long strings #430

Open
alexisdrakopoulos opened this issue Feb 5, 2025 · 3 comments
Open

partial_ratio freezes on extremely long strings #430

alexisdrakopoulos opened this issue Feb 5, 2025 · 3 comments

Comments

@alexisdrakopoulos
Copy link

I was running partial_ratio on millions of strings and saw my program was randomly freezing. Upon further investigation I saw that if I cut off my strings at 5000 characters it no longer froze. It doesn't matter how long I left my program running the freezing persisted.

Not only did it freeze but it doesn't respond to interrupt commands. Here is a simple example:

from rapidfuzz import fuzz

test_str = "hello" * 10000
# test_str = test_str[:5000]
fuzz.partial_ratio(test_str, test_str)
print("hi")

uncomment the line to unfreeze. Note this will freeze.

@maxbachmann
Copy link
Member

Problem

The foundational problem is how this algorithm is "supposed" to find the optimal alignment. It's supposed to search for a substring in the longer sequence, that is as similar as possible to the shorter string. This substring has to be either as long as the shorter string, or shorter if it's placed at start / end of the longer string. So a sliding window of the shorter string on the longer string.
Since each comparision is O(N^2), the full algorithm has a runtime of O(N^3). For short needles (<64 characters) this is pretty irrelevant since the comparison can run on 64 characters at the same time and is extremly fast for those. However it becomes quite a bit slower for longer string.
I do have a couple of optimizations so we don't have to run on each position of a sliding window.

  1. if one of the strings is shorter I calculate all alignments within the string using some kind of binary search. I always skip some alignments and then check whether any of the alignments in between could have a better similarity

  2. for the sequences placed at the start I skip the sequence if the last character isn't part of the longer string

  3. for the sequences placed at the end I skip the sequence if thefirst character isn't part of the longer string

Now in your example these break apart completely:

    1. isn't run since the strings have the same length
    1. takes an insanely long time to calculate the similarity for all alignments starting with useless ones like a 1 char string. It can't skip any since there are no unique characters
    1. finally finds does the == check and exits

Possible Improvements

I am not aware of any way to improve the foundational nature of the algorithm. However there are a couple of things I could improve:

  • for 2) start out with the longer needles and not the shorter ones
  • calculate 3) before 2) this way cases with two same strings exit a lot faster
  • skip whole sections based on the score_cutoff and string lengths. I already do this for each individual comparison, but it should be faster to completely skip the sections.

Implementing these improvements should be pretty straightforward. I will give this a try this evening.
One thing you can do to help the algorithm skip more alignments. This really just decreases runtime by a percentage and doesn't solve the problem of the algorithm being O(N^3).

Freezing

The reason it doesn't respond to interrupt commands is that it's purely running in a C++ library with the GIL released. Since Python captures these signals I would need to acquire the gil from time to time and react to these signals. For cdist I actually do this from time to time. However even there I have to wait for the individual similarity function to return. This would require handling inside the individual scorers, but that's problematic since those a used as standalone C++ library.
In addition this could have a huge performance overhead for shorter strings.

@alexisdrakopoulos
Copy link
Author

alexisdrakopoulos commented Feb 6, 2025 via email

@maxbachmann
Copy link
Member

One thing you can do to help the algorithm skip more alignments. This really just decreases runtime by a percentage and doesn't solve the problem of the algorithm being O(N^3).

Ups forgot to write what you can do 😓 If you know you are only interested in matches above a certain similarity, you can specify the score_cutoff parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants