You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be very useful to have the option of using all CPU's with a workers =-1 option in extractOne. I feel this should be possible since cdist already incorporates it and I personally find extractOne a lot more simple to use.
My use case it to match a large dataset with about 100k records to a lookup table with over 5million entries, this is currently proving to be very slow, with my machine only managing about 2k records per hour.
The text was updated successfully, but these errors were encountered:
The implementation of extractOne is somewhat different from cdist due to the ability to exit early in the case of a perfect match.
So in pseudo code it's implemented like this:
query = preprocess(query)
for choice in choices:
choice = preprocess(choice)
scorer(query, choice)
if perfectScore:
exit
To make it possible to multithread this we have two options:
acquire/release the gil on each iteration.
preprocess ahead of time:
query = preprocess(query)
choices = [preprocess(choice) for choice in choices]
for choice in choices:
scorer(query, choice)
if perfectScore:
exit
I wasn't really happy with either of the two variants:
has to acquire/release the gil pretty often, but allows the short circuiting to skip preprocessing
doesn't allow preprocessing to be short circuited. Using SIMD would be possible, but might not be worth sorting the string list first.
As for multi-threading it would probably make most sense to run extractOne in parallel instead of running the parallelism inside extractOne. You could already do this using multiprocessing. What's the similarity function you are using? Do you specify a preprocessing function?
The parallelism mentioned in this issue is what I was hoping to implement. However, it’s a significant task, and I haven’t had much time to dedicate to it recently. I’d still love to see it happen at some point, but considering the limited number of users who would truly benefit from it, it hasn't been a top priority.
It would be very useful to have the option of using all CPU's with a workers =-1 option in extractOne. I feel this should be possible since cdist already incorporates it and I personally find extractOne a lot more simple to use.
My use case it to match a large dataset with about 100k records to a lookup table with over 5million entries, this is currently proving to be very slow, with my machine only managing about 2k records per hour.
The text was updated successfully, but these errors were encountered: