-
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partial_ratio freezes on extremely long strings #430
Comments
ProblemThe foundational problem is how this algorithm is "supposed" to find the optimal alignment. It's supposed to search for a substring in the longer sequence, that is as similar as possible to the shorter string. This substring has to be either as long as the shorter string, or shorter if it's placed at start / end of the longer string. So a sliding window of the shorter string on the longer string.
Now in your example these break apart completely:
Possible ImprovementsI am not aware of any way to improve the foundational nature of the algorithm. However there are a couple of things I could improve:
Implementing these improvements should be pretty straightforward. I will give this a try this evening. FreezingThe reason it doesn't respond to interrupt commands is that it's purely running in a C++ library with the GIL released. Since Python captures these signals I would need to acquire the gil from time to time and react to these signals. For |
That's very fair. I'm sure my use case is an edge case.
Really excellent reply from you. For now I'm just limiting strings to 1000
characters. Maybe I should indeed run an equality check first.
…On Thu, Feb 6, 2025, 12:23 Max Bachmann ***@***.***> wrote:
Problem
The foundational problem is how this algorithm is "supposed" to find the
optimal alignment. It's supposed to search for a substring in the longer
sequence, that is as similar as possible to the shorter string. This
substring has to be either as long as the shorter string, or shorter if
it's placed at start / end of the longer string. So a sliding window of the
shorter string on the longer string.
Since each comparision is O(N^2), the full algorithm has a runtime of
O(N^3). For short needles (<64 characters) this is pretty irrelevant
since the comparison can run on 64 characters at the same time and is
extremly fast for those. However it becomes quite a bit slower for longer
string.
I do have a couple of optimizations so we don't have to run on each
position of a sliding window.
1.
if one of the strings is shorter I calculate all alignments within the
string using some kind of binary search. I always skip some alignments and
then check whether any of the alignments in between could have a better
similarity
2.
for the sequences placed at the start I skip the sequence if the last
character isn't part of the longer string
3.
for the sequences placed at the end I skip the sequence if thefirst
character isn't part of the longer string
Now in your example these break apart completely:
-
1. isn't run since the strings have the same length
-
2. takes an insanely long time to calculate the similarity for all
alignments starting with useless ones like a 1 char string. It can't skip
any since there are no unique characters
-
3. finally finds does the == check and exits
Possible Improvements
I am not aware of any way to improve the foundational nature of the
algorithm. However there are a couple of things I could improve:
- for 2) start out with the longer needles and not the shorter ones
- calculate 3) before 2) this way cases with two same strings exit a
lot faster
- skip whole sections based on the score_cutoff and string lengths. I
already do this for each individual comparison, but it should be faster to
completely skip the sections.
Implementing these improvements should be pretty straightforward. I will
give this a try this evening.
One thing you can do to help the algorithm skip more alignments. This
really just decreases runtime by a percentage and doesn't solve the problem
of the algorithm being O(N^3).
Freezing
The reason it doesn't respond to interrupt commands is that it's purely
running in a C++ library with the GIL released. Since Python captures these
signals I would need to acquire the gil from time to time and react to
these signals. For cdist I actually do this from time to time. However
even there I have to wait for the individual similarity function to return.
This would require handling inside the individual scorers, but that's
problematic since those a used as standalone C++ library.
In addition this could have a huge performance overhead for shorter
strings.
—
Reply to this email directly, view it on GitHub
<#430 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76IO5QDZIC22XVMW7R32ONA2DAVCNFSM6AAAAABWR3ZWCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZZGU2DSNRXGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ups forgot to write what you can do 😓 If you know you are only interested in matches above a certain similarity, you can specify the |
I was running partial_ratio on millions of strings and saw my program was randomly freezing. Upon further investigation I saw that if I cut off my strings at 5000 characters it no longer froze. It doesn't matter how long I left my program running the freezing persisted.
Not only did it freeze but it doesn't respond to interrupt commands. Here is a simple example:
uncomment the line to unfreeze. Note this will freeze.
The text was updated successfully, but these errors were encountered: