Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set CPU threads according to the machine #965

Merged

Conversation

ozancaglayan
Copy link
Contributor

@ozancaglayan ozancaglayan commented Aug 18, 2024

Fixes suboptimal performance on CPUs with less than 16 cores by setting the number of CPU threads to the actual number of physical cores dynamically.

Fixes #917

@ozancaglayan ozancaglayan changed the title Set CPU threads according to the machine (#917) Set CPU threads according to the machine Aug 18, 2024
@@ -587,7 +588,7 @@ def __init__(
device: str = "auto",
device_index: Union[int, List[int]] = 0,
compute_type: str = "default",
cpu_threads: int = 16,
cpu_threads: int = multiprocessing.cpu_count() // 2,
Copy link
Collaborator

@MahmoudAshraf97 MahmoudAshraf97 Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be correct on Intel processors that use E and P cores, for example intel i7 12700k has 12 cores and 20 threads, wouldn't it be better to set it to 0 to use the system default value? as even if cpu_count//2 is correct, this is the number of threads that will be used per process, and if num_workers was set to 2 or higher, each worker is going to request the total number of threads to itself which will result in a lot of context switching.

Refer to intra_thread history in CT2 Changelog and Performance Tips for more information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be correct on Intel processors that use E and P cores, for example intel i7 12700k has 12 cores and 20 threads, wouldn't it be better to set it to 0 to use the system default value? as even if cpu_count//2 is correct, this is the number of threads that will be used per process, and if num_workers was set to 2 or higher, each worker is going to request the total number of threads to itself which will result in a lot of context switching.

Refer to intra_thread history in CT2 Changelog and Performance Tips for more information

To account for this you could take function in my last message and divide whatever number you want to use by the number of workers, could you not...thus always leaving four threads/logical cores for a user's background tasks on their computer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use the CT2 default value which will suit most users, power users always have the option to tune this value to their needs, as you mentioned, each use case will have a best value, there is no one value fits all here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use the CT2 default value which will suit most users, power users always have the option to tune this value to their needs, as you mentioned, each use case will have a best value, there is no one value fits all here

True, impossible to plan for all use cases dynamically...At least @ozancaglayan has some cool functionality if he wants to experiment with it in his scripts - i.e. a dynamic way to set "cpu_threads" based on p/e cores, logical cores or what have you! Just determine the cpu_threads in your own script and then pass it to faster-whisper right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting to the original '0' seems best for widest compatibility

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Aug 20, 2024

I just created a perfect function for this hold on and I'll post it..

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Aug 20, 2024

Here it is - NOTE, this is for Intel cpus and would need to be modified to work with apple/amd:

  1. If you only care about the number of threads (regardless of cpu architecture) will work fine.
    logical_cores = psutil.cpu_count(logical=True)
  • This is the same as using multiprocessing.cpu_count()
  1. To get physical cores with psutil:
  • psutil.cpu_count(logical=False)
  1. I'm not aware of another library besides psutil that can get both logical and physical cores. However, psutil enables you to deduce the number of performance versus efficiency cores with the following trickery:
import psutil

def get_core_count():
    logical_cores = psutil.cpu_count(logical=True)
    physical_cores = psutil.cpu_count(logical=False)
    
    # logical cores equal physical cores (no hyper-threading)
    if logical_cores == physical_cores:
        thread_count = logical_cores
    # logical cores are exactly twice physical cores (hyper-threading)
    elif logical_cores == 2 * physical_cores:
        hyperthreaded_non_performance_cores = physical_cores
        thread_count = logical_cores
    # For hybrid architecture CPUs - i.e. CPUs with performance/efficiency cores
    else:
        performance_cores = logical_cores - physical_cores
        efficiency_cores = physical_cores - (logical_cores - physical_cores)
        thread_count = logical_cores

    print(f"Performance cores = {performance_cores}")
    print(f"Efficiency cores = {efficiency_cores}")

EXPLANATION:

  • Older cpus without hyper-threading:
    • Logical cores will always equal physical cores. This is accounted for in the first "if" statement.
  • Post-hyperthreading (but before the performance/efficiency hybrid architecture):
    • "physical" cores will always be exactly be double the "logical" cores. This is accounted for in the first "elif" statement.
  • For the new performance/efficiency hybrid architecture:
    • The number of performance cores will always equal logical cores minus physical cores. This is based on the fact that p-cores always have 2 threads while e-cores have 1.
    • Finally, the number of efficiency cores can be calculated with this line:
efficiency_cores = physical_cores - (logical_cores - physical_cores)

Apple/AMD

  • It's my understanding apple cpu cores don't use hyperthreading, but some can be p-cores (e.g. "Firestorm" or "Avalanche") versus e-cores (e.g. "Icestorm" or "Blizzard")
    • Someone with a mac would have to modify the above if you truly get the p-cores versus e-cores on a mac...otherwise, just use psutil's "logical" core count.
  • AMD cpus currently use hyperthreading (caled SMT) but don't use a performance/efficiency distinction.

Conclusion:

I've always wondered why the default cpu_threads was 4...likely implemented awhile ago before the increasing core counts. Regardless, if you want to set it dynamically to maximize the cpu-speed of faster-whisper I'd recommend changing the default. HOWEVER, I highly recommend always leave four threads - i.e.g four logical cores as psutil uses that term...needed for the typical users background tasks just to be safe.

@ozancaglayan
Copy link
Contributor Author

ozancaglayan commented Aug 21, 2024

Yes, its true that the // 2 will underestimate the number of CPU threads if there is no hyperthreading type logical cores. Correct me if I'm wrong but setting this value to 0 does not seem to be setting it to the actual number of cores

https://github.com/OpenNMT/CTranslate2/blob/8ba828c0cf3d72e93ec675cd2e472b64b8c55b64/src/utils.cc#L77

  • if num_threads == 0 , it tries to read OMP_NUM_THREADS from env
    • If its found, that value is set
    • If not get_default_num_threads() is called
      • this function will pick std::min(default_num_threads (which is 4 in Ctranslate), max_num_threads) for some reason e.g. it will never set the number of threads to std::thread::hardware_concurrency() and cap it to 4

So in short:

  • Current main will perform much slower than the optimum if the machine has < 16 physical cores
  • Setting it to 0, will set it to 4 internally in CTranslate2 (old default), will underperform for modern machines
  • Setting it to cpu_count() // 2 can offer a good tradeoff

https://stackoverflow.com/a/55423170/821797

I don't have any preferences tbh. Just trying to contribute. I'm not even using CPUs for Whisper.

Thanks

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Aug 22, 2024

Personally, I'd like to like for the 4 number to be updated for faster-whisper, even if it deviates from what ctranslate2 sets as the default. This would reflect the reality that cpu thread counts have increased. With that being said, I understand that others like @MahmoudAshraf97 and others have more clout than me since they actually contribute to the code base while I simply stand on the sidelines reaping the benefits...but if it were my repo here's how I'd set it:

  1. Get the number of "logical" cores - i.e. threads.
  2. Set the threads to either (a) the number of logical cores minus 8 or (b) 4, whichever is more...
  3. If the user has specified a number of processes - i.e. "inter threads - divide the amount of threads in 1-2 above by the number of inter threads.

I don't see the harm in this. 4 threads is clearly outdated...maybe bump it to 6...8 even? That's all I'll say. Will let the big boys who actually contribute code to the library decide. ;-)

@MahmoudAshraf97
Copy link
Collaborator

MahmoudAshraf97 commented Aug 22, 2024

We are talking two different directions here, setting it to 0 to use CT2 default value might not result in the best performance but it will be the most compatible option for most users, setting it to use the max number of threads may (or may not, needs to be tested on multiple configurations) result in the best performance, but it also has an implicit assumption that FW is the highest priority process running which might not be the case, in such case, I'd opt for the safest option which is setting it to 0, in all cases this is not a hardcoded parameters and users should tune it to their liking

Copy link
Collaborator

@MahmoudAshraf97 MahmoudAshraf97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after a further investigation, I found that after using 4 or more cores the returns are negligible, so I'd prefer to stick to the original value until a better solution is implemented in CT2

faster_whisper/transcribe.py Outdated Show resolved Hide resolved
faster_whisper/transcribe.py Outdated Show resolved Hide resolved
faster_whisper/transcribe.py Outdated Show resolved Hide resolved
@MahmoudAshraf97 MahmoudAshraf97 merged commit f978fa2 into SYSTRAN:master Oct 30, 2024
3 checks passed
mesnilgr pushed a commit to mesnilgr/faster-whisper that referenced this pull request Oct 30, 2024
@ooobo
Copy link

ooobo commented Oct 30, 2024

after a further investigation, I found that after using 4 or more cores the returns are negligible, so I'd prefer to stick to the original value until a better solution is implemented in CT2

just to clarify - in 1.0.3 before the 'big pr' cpu_threads default was 0 not 4, and as ozancaglayan noted it ends up being 4 if OMP_NUM_THREADS isn't set. Haven't looked at the code but perhaps better as 0 for back compatibility?

https://github.com/OpenNMT/CTranslate2/blob/8ba828c0cf3d72e93ec675cd2e472b64b8c55b64/src/utils.cc#L77

  • if num_threads == 0 , it tries to read OMP_NUM_THREADS from env

    • If its found, that value is set

    • If not get_default_num_threads() is called

      • this function will pick std::min(default_num_threads (which is 4 in Ctranslate), max_num_threads) for some reason e.g. it will never set the number of threads to std::thread::hardware_concurrency() and cap it to 4

@MahmoudAshraf97
Copy link
Collaborator

@ooobo You are correct, my bad for not confirming this

MahmoudAshraf97 added a commit that referenced this pull request Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

major slowdown with batching commit - cpu only
4 participants