Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster performance with extractOne #429

Open
VaradDaniel opened this issue Feb 4, 2025 · 1 comment
Open

Faster performance with extractOne #429

VaradDaniel opened this issue Feb 4, 2025 · 1 comment

Comments

@VaradDaniel
Copy link

It would be very useful to have the option of using all CPU's with a workers =-1 option in extractOne. I feel this should be possible since cdist already incorporates it and I personally find extractOne a lot more simple to use.

My use case it to match a large dataset with about 100k records to a lookup table with over 5million entries, this is currently proving to be very slow, with my machine only managing about 2k records per hour.

@maxbachmann
Copy link
Member

The implementation of extractOne is somewhat different from cdist due to the ability to exit early in the case of a perfect match.

So in pseudo code it's implemented like this:

query = preprocess(query)
for choice in choices:
   choice = preprocess(choice)
   scorer(query, choice)
   if perfectScore:
       exit

To make it possible to multithread this we have two options:

  1. acquire/release the gil on each iteration.
  2. preprocess ahead of time:
query = preprocess(query)
choices = [preprocess(choice) for choice in choices]
for choice in choices:
   scorer(query, choice)
   if perfectScore:
       exit

I wasn't really happy with either of the two variants:

  1. has to acquire/release the gil pretty often, but allows the short circuiting to skip preprocessing
  2. doesn't allow preprocessing to be short circuited. Using SIMD would be possible, but might not be worth sorting the string list first.

As for multi-threading it would probably make most sense to run extractOne in parallel instead of running the parallelism inside extractOne. You could already do this using multiprocessing. What's the similarity function you are using? Do you specify a preprocessing function?

The parallelism mentioned in this issue is what I was hoping to implement. However, it’s a significant task, and I haven’t had much time to dedicate to it recently. I’d still love to see it happen at some point, but considering the limited number of users who would truly benefit from it, it hasn't been a top priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants