partial_ratio freezes on extremely long strings #430

alexisdrakopoulos · 2025-02-05T19:42:34Z

I was running partial_ratio on millions of strings and saw my program was randomly freezing. Upon further investigation I saw that if I cut off my strings at 5000 characters it no longer froze. It doesn't matter how long I left my program running the freezing persisted.

Not only did it freeze but it doesn't respond to interrupt commands. Here is a simple example:

from rapidfuzz import fuzz

test_str = "hello" * 10000
# test_str = test_str[:5000]
fuzz.partial_ratio(test_str, test_str)
print("hi")

uncomment the line to unfreeze. Note this will freeze.

maxbachmann · 2025-02-06T11:22:51Z

Problem

The foundational problem is how this algorithm is "supposed" to find the optimal alignment. It's supposed to search for a substring in the longer sequence, that is as similar as possible to the shorter string. This substring has to be either as long as the shorter string, or shorter if it's placed at start / end of the longer string. So a sliding window of the shorter string on the longer string.
Since each comparision is O(N^2), the full algorithm has a runtime of O(N^3). For short needles (<64 characters) this is pretty irrelevant since the comparison can run on 64 characters at the same time and is extremly fast for those. However it becomes quite a bit slower for longer string.
I do have a couple of optimizations so we don't have to run on each position of a sliding window.

if one of the strings is shorter I calculate all alignments within the string using some kind of binary search. I always skip some alignments and then check whether any of the alignments in between could have a better similarity
for the sequences placed at the start I skip the sequence if the last character isn't part of the longer string
for the sequences placed at the end I skip the sequence if thefirst character isn't part of the longer string

Now in your example these break apart completely:

1. isn't run since the strings have the same length
1. takes an insanely long time to calculate the similarity for all alignments starting with useless ones like a 1 char string. It can't skip any since there are no unique characters
1. finally finds does the == check and exits

Possible Improvements

I am not aware of any way to improve the foundational nature of the algorithm. However there are a couple of things I could improve:

for 2) start out with the longer needles and not the shorter ones
calculate 3) before 2) this way cases with two same strings exit a lot faster
skip whole sections based on the score_cutoff and string lengths. I already do this for each individual comparison, but it should be faster to completely skip the sections.

Implementing these improvements should be pretty straightforward. I will give this a try this evening.
One thing you can do to help the algorithm skip more alignments. This really just decreases runtime by a percentage and doesn't solve the problem of the algorithm being O(N^3).

Freezing

The reason it doesn't respond to interrupt commands is that it's purely running in a C++ library with the GIL released. Since Python captures these signals I would need to acquire the gil from time to time and react to these signals. For cdist I actually do this from time to time. However even there I have to wait for the individual similarity function to return. This would require handling inside the individual scorers, but that's problematic since those a used as standalone C++ library.
In addition this could have a huge performance overhead for shorter strings.

alexisdrakopoulos · 2025-02-06T11:25:04Z

That's very fair. I'm sure my use case is an edge case. Really excellent reply from you. For now I'm just limiting strings to 1000 characters. Maybe I should indeed run an equality check first.

…

On Thu, Feb 6, 2025, 12:23 Max Bachmann ***@***.***> wrote: Problem The foundational problem is how this algorithm is "supposed" to find the optimal alignment. It's supposed to search for a substring in the longer sequence, that is as similar as possible to the shorter string. This substring has to be either as long as the shorter string, or shorter if it's placed at start / end of the longer string. So a sliding window of the shorter string on the longer string. Since each comparision is O(N^2), the full algorithm has a runtime of O(N^3). For short needles (<64 characters) this is pretty irrelevant since the comparison can run on 64 characters at the same time and is extremly fast for those. However it becomes quite a bit slower for longer string. I do have a couple of optimizations so we don't have to run on each position of a sliding window. 1. if one of the strings is shorter I calculate all alignments within the string using some kind of binary search. I always skip some alignments and then check whether any of the alignments in between could have a better similarity 2. for the sequences placed at the start I skip the sequence if the last character isn't part of the longer string 3. for the sequences placed at the end I skip the sequence if thefirst character isn't part of the longer string Now in your example these break apart completely: - 1. isn't run since the strings have the same length - 2. takes an insanely long time to calculate the similarity for all alignments starting with useless ones like a 1 char string. It can't skip any since there are no unique characters - 3. finally finds does the == check and exits Possible Improvements I am not aware of any way to improve the foundational nature of the algorithm. However there are a couple of things I could improve: - for 2) start out with the longer needles and not the shorter ones - calculate 3) before 2) this way cases with two same strings exit a lot faster - skip whole sections based on the score_cutoff and string lengths. I already do this for each individual comparison, but it should be faster to completely skip the sections. Implementing these improvements should be pretty straightforward. I will give this a try this evening. One thing you can do to help the algorithm skip more alignments. This really just decreases runtime by a percentage and doesn't solve the problem of the algorithm being O(N^3). Freezing The reason it doesn't respond to interrupt commands is that it's purely running in a C++ library with the GIL released. Since Python captures these signals I would need to acquire the gil from time to time and react to these signals. For cdist I actually do this from time to time. However even there I have to wait for the individual similarity function to return. This would require handling inside the individual scorers, but that's problematic since those a used as standalone C++ library. In addition this could have a huge performance overhead for shorter strings. — Reply to this email directly, view it on GitHub <#430 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4X76IO5QDZIC22XVMW7R32ONA2DAVCNFSM6AAAAABWR3ZWCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZZGU2DSNRXGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

maxbachmann · 2025-02-06T12:31:03Z

One thing you can do to help the algorithm skip more alignments. This really just decreases runtime by a percentage and doesn't solve the problem of the algorithm being O(N^3).

Ups forgot to write what you can do 😓 If you know you are only interested in matches above a certain similarity, you can specify the score_cutoff parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partial_ratio freezes on extremely long strings #430

partial_ratio freezes on extremely long strings #430

alexisdrakopoulos commented Feb 5, 2025

maxbachmann commented Feb 6, 2025

alexisdrakopoulos commented Feb 6, 2025 via email

maxbachmann commented Feb 6, 2025

partial_ratio freezes on extremely long strings #430

partial_ratio freezes on extremely long strings #430

Comments

alexisdrakopoulos commented Feb 5, 2025

maxbachmann commented Feb 6, 2025

Problem

Possible Improvements

Freezing

alexisdrakopoulos commented Feb 6, 2025 via email

maxbachmann commented Feb 6, 2025