Faster performance with extractOne #429

VaradDaniel · 2025-02-04T22:11:15Z

It would be very useful to have the option of using all CPU's with a workers =-1 option in extractOne. I feel this should be possible since cdist already incorporates it and I personally find extractOne a lot more simple to use.

My use case it to match a large dataset with about 100k records to a lookup table with over 5million entries, this is currently proving to be very slow, with my machine only managing about 2k records per hour.

maxbachmann · 2025-02-06T12:28:33Z

The implementation of extractOne is somewhat different from cdist due to the ability to exit early in the case of a perfect match.

So in pseudo code it's implemented like this:

query = preprocess(query)
for choice in choices:
   choice = preprocess(choice)
   scorer(query, choice)
   if perfectScore:
       exit

To make it possible to multithread this we have two options:

acquire/release the gil on each iteration.
preprocess ahead of time:

query = preprocess(query)
choices = [preprocess(choice) for choice in choices]
for choice in choices:
   scorer(query, choice)
   if perfectScore:
       exit

I wasn't really happy with either of the two variants:

has to acquire/release the gil pretty often, but allows the short circuiting to skip preprocessing
doesn't allow preprocessing to be short circuited. Using SIMD would be possible, but might not be worth sorting the string list first.

As for multi-threading it would probably make most sense to run extractOne in parallel instead of running the parallelism inside extractOne. You could already do this using multiprocessing. What's the similarity function you are using? Do you specify a preprocessing function?

The parallelism mentioned in this issue is what I was hoping to implement. However, it’s a significant task, and I haven’t had much time to dedicate to it recently. I’d still love to see it happen at some point, but considering the limited number of users who would truly benefit from it, it hasn't been a top priority.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster performance with extractOne #429

Faster performance with extractOne #429

VaradDaniel commented Feb 4, 2025

maxbachmann commented Feb 6, 2025

Faster performance with extractOne #429

Faster performance with extractOne #429

Comments

VaradDaniel commented Feb 4, 2025

maxbachmann commented Feb 6, 2025