Set CPU threads according to the machine #965

ozancaglayan · 2024-08-18T12:57:58Z

Fixes suboptimal performance on CPUs with less than 16 cores by setting the number of CPU threads to the actual number of physical cores dynamically.

Fixes #917

MahmoudAshraf97 · 2024-08-20T13:12:44Z

faster_whisper/transcribe.py

@@ -587,7 +588,7 @@ def __init__(
        device: str = "auto",
        device_index: Union[int, List[int]] = 0,
        compute_type: str = "default",
-        cpu_threads: int = 16,
+        cpu_threads: int = multiprocessing.cpu_count() // 2,


This will not be correct on Intel processors that use E and P cores, for example intel i7 12700k has 12 cores and 20 threads, wouldn't it be better to set it to 0 to use the system default value? as even if cpu_count//2 is correct, this is the number of threads that will be used per process, and if num_workers was set to 2 or higher, each worker is going to request the total number of threads to itself which will result in a lot of context switching.

Refer to intra_thread history in CT2 Changelog and Performance Tips for more information

This will not be correct on Intel processors that use E and P cores, for example intel i7 12700k has 12 cores and 20 threads, wouldn't it be better to set it to 0 to use the system default value? as even if cpu_count//2 is correct, this is the number of threads that will be used per process, and if num_workers was set to 2 or higher, each worker is going to request the total number of threads to itself which will result in a lot of context switching.

Refer to intra_thread history in CT2 Changelog and Performance Tips for more information

To account for this you could take function in my last message and divide whatever number you want to use by the number of workers, could you not...thus always leaving four threads/logical cores for a user's background tasks on their computer?

I'd rather use the CT2 default value which will suit most users, power users always have the option to tune this value to their needs, as you mentioned, each use case will have a best value, there is no one value fits all here

I'd rather use the CT2 default value which will suit most users, power users always have the option to tune this value to their needs, as you mentioned, each use case will have a best value, there is no one value fits all here

True, impossible to plan for all use cases dynamically...At least @ozancaglayan has some cool functionality if he wants to experiment with it in his scripts - i.e. a dynamic way to set "cpu_threads" based on p/e cores, logical cores or what have you! Just determine the cpu_threads in your own script and then pass it to faster-whisper right?

Reverting to the original '0' seems best for widest compatibility

BBC-Esq · 2024-08-20T13:15:50Z

I just created a perfect function for this hold on and I'll post it..

BBC-Esq · 2024-08-20T14:39:41Z

Here it is - NOTE, this is for Intel cpus and would need to be modified to work with apple/amd:

If you only care about the number of threads (regardless of cpu architecture) will work fine.
logical_cores = psutil.cpu_count(logical=True)

This is the same as using multiprocessing.cpu_count()

To get physical cores with psutil:

psutil.cpu_count(logical=False)

I'm not aware of another library besides psutil that can get both logical and physical cores. However, psutil enables you to deduce the number of performance versus efficiency cores with the following trickery:

import psutil

def get_core_count():
    logical_cores = psutil.cpu_count(logical=True)
    physical_cores = psutil.cpu_count(logical=False)
    
    # logical cores equal physical cores (no hyper-threading)
    if logical_cores == physical_cores:
        thread_count = logical_cores
    # logical cores are exactly twice physical cores (hyper-threading)
    elif logical_cores == 2 * physical_cores:
        hyperthreaded_non_performance_cores = physical_cores
        thread_count = logical_cores
    # For hybrid architecture CPUs - i.e. CPUs with performance/efficiency cores
    else:
        performance_cores = logical_cores - physical_cores
        efficiency_cores = physical_cores - (logical_cores - physical_cores)
        thread_count = logical_cores

    print(f"Performance cores = {performance_cores}")
    print(f"Efficiency cores = {efficiency_cores}")

EXPLANATION:

Older cpus without hyper-threading:
- Logical cores will always equal physical cores. This is accounted for in the first "if" statement.
Post-hyperthreading (but before the performance/efficiency hybrid architecture):
- "physical" cores will always be exactly be double the "logical" cores. This is accounted for in the first "elif" statement.
For the new performance/efficiency hybrid architecture:
- The number of performance cores will always equal logical cores minus physical cores. This is based on the fact that p-cores always have 2 threads while e-cores have 1.
- Finally, the number of efficiency cores can be calculated with this line:

efficiency_cores = physical_cores - (logical_cores - physical_cores)

Apple/AMD

It's my understanding apple cpu cores don't use hyperthreading, but some can be p-cores (e.g. "Firestorm" or "Avalanche") versus e-cores (e.g. "Icestorm" or "Blizzard")
- Someone with a mac would have to modify the above if you truly get the p-cores versus e-cores on a mac...otherwise, just use psutil's "logical" core count.
AMD cpus currently use hyperthreading (caled SMT) but don't use a performance/efficiency distinction.

Conclusion:

I've always wondered why the default cpu_threads was 4...likely implemented awhile ago before the increasing core counts. Regardless, if you want to set it dynamically to maximize the cpu-speed of faster-whisper I'd recommend changing the default. HOWEVER, I highly recommend always leave four threads - i.e.g four logical cores as psutil uses that term...needed for the typical users background tasks just to be safe.

ozancaglayan · 2024-08-21T15:56:01Z

Yes, its true that the // 2 will underestimate the number of CPU threads if there is no hyperthreading type logical cores. Correct me if I'm wrong but setting this value to 0 does not seem to be setting it to the actual number of cores

https://github.com/OpenNMT/CTranslate2/blob/8ba828c0cf3d72e93ec675cd2e472b64b8c55b64/src/utils.cc#L77

if num_threads == 0 , it tries to read OMP_NUM_THREADS from env
- If its found, that value is set
- If not get_default_num_threads() is called
  - this function will pick std::min(default_num_threads (which is 4 in Ctranslate), max_num_threads) for some reason e.g. it will never set the number of threads to std::thread::hardware_concurrency() and cap it to 4

So in short:

Current main will perform much slower than the optimum if the machine has < 16 physical cores
Setting it to 0, will set it to 4 internally in CTranslate2 (old default), will underperform for modern machines
Setting it to cpu_count() // 2 can offer a good tradeoff

https://stackoverflow.com/a/55423170/821797

I don't have any preferences tbh. Just trying to contribute. I'm not even using CPUs for Whisper.

Thanks

BBC-Esq · 2024-08-22T12:30:29Z

Personally, I'd like to like for the 4 number to be updated for faster-whisper, even if it deviates from what ctranslate2 sets as the default. This would reflect the reality that cpu thread counts have increased. With that being said, I understand that others like @MahmoudAshraf97 and others have more clout than me since they actually contribute to the code base while I simply stand on the sidelines reaping the benefits...but if it were my repo here's how I'd set it:

Get the number of "logical" cores - i.e. threads.
Set the threads to either (a) the number of logical cores minus 8 or (b) 4, whichever is more...
If the user has specified a number of processes - i.e. "inter threads - divide the amount of threads in 1-2 above by the number of inter threads.

I don't see the harm in this. 4 threads is clearly outdated...maybe bump it to 6...8 even? That's all I'll say. Will let the big boys who actually contribute code to the library decide. ;-)

MahmoudAshraf97 · 2024-08-22T17:10:28Z

We are talking two different directions here, setting it to 0 to use CT2 default value might not result in the best performance but it will be the most compatible option for most users, setting it to use the max number of threads may (or may not, needs to be tested on multiple configurations) result in the best performance, but it also has an implicit assumption that FW is the highest priority process running which might not be the case, in such case, I'd opt for the safest option which is setting it to 0, in all cases this is not a hardcoded parameters and users should tune it to their liking

by mistake

MahmoudAshraf97

after a further investigation, I found that after using 4 or more cores the returns are negligible, so I'd prefer to stick to the original value until a better solution is implemented in CT2

faster_whisper/transcribe.py

Co-authored-by: Mahmoud Ashraf <[email protected]>

ooobo · 2024-10-30T19:43:19Z

after a further investigation, I found that after using 4 or more cores the returns are negligible, so I'd prefer to stick to the original value until a better solution is implemented in CT2

just to clarify - in 1.0.3 before the 'big pr' cpu_threads default was 0 not 4, and as ozancaglayan noted it ends up being 4 if OMP_NUM_THREADS isn't set. Haven't looked at the code but perhaps better as 0 for back compatibility?

https://github.com/OpenNMT/CTranslate2/blob/8ba828c0cf3d72e93ec675cd2e472b64b8c55b64/src/utils.cc#L77

if num_threads == 0 , it tries to read OMP_NUM_THREADS from env

If its found, that value is set

If not get_default_num_threads() is called

this function will pick std::min(default_num_threads (which is 4 in Ctranslate), max_num_threads) for some reason e.g. it will never set the number of threads to std::thread::hardware_concurrency() and cap it to 4

MahmoudAshraf97 · 2024-10-30T19:49:08Z

@ooobo You are correct, my bad for not confirming this

#965 (comment)

ozancaglayan changed the title ~~Set CPU threads according to the machine (#917)~~ Set CPU threads according to the machine Aug 18, 2024

MahmoudAshraf97 reviewed Aug 20, 2024

View reviewed changes

MahmoudAshraf97 previously approved these changes Oct 23, 2024

View reviewed changes

set cpu threads according to the machine

37a0365

MahmoudAshraf97 force-pushed the optimum-number-of-threads branch from 8ad11ae to 37a0365 Compare October 30, 2024 13:06

MahmoudAshraf97 requested changes Oct 30, 2024

View reviewed changes

faster_whisper/transcribe.py Outdated Show resolved Hide resolved

faster_whisper/transcribe.py Outdated Show resolved Hide resolved

faster_whisper/transcribe.py Outdated Show resolved Hide resolved

Apply suggestions from code review

e0a3534

MahmoudAshraf97 approved these changes Oct 30, 2024

View reviewed changes

MahmoudAshraf97 merged commit f978fa2 into SYSTRAN:master Oct 30, 2024
3 checks passed

mesnilgr pushed a commit to mesnilgr/faster-whisper that referenced this pull request Oct 30, 2024

Revert CPU default threads to 4 (SYSTRAN#965)

3c176be

Co-authored-by: Mahmoud Ashraf <[email protected]>

MahmoudAshraf97 added a commit that referenced this pull request Oct 30, 2024

Revert CPU default threads to 0

814472f

#965 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set CPU threads according to the machine #965

Set CPU threads according to the machine #965

ozancaglayan commented Aug 18, 2024 •

edited

Loading

MahmoudAshraf97 Aug 20, 2024 •

edited

Loading

BBC-Esq Aug 20, 2024

MahmoudAshraf97 Aug 20, 2024

BBC-Esq Aug 20, 2024

ooobo Aug 21, 2024

BBC-Esq commented Aug 20, 2024 •

edited

Loading

BBC-Esq commented Aug 20, 2024 •

edited

Loading

ozancaglayan commented Aug 21, 2024 •

edited

Loading

BBC-Esq commented Aug 22, 2024

MahmoudAshraf97 commented Aug 22, 2024 •

edited

Loading

MahmoudAshraf97 left a comment

ooobo commented Oct 30, 2024

MahmoudAshraf97 commented Oct 30, 2024

Set CPU threads according to the machine #965

Set CPU threads according to the machine #965

Conversation

ozancaglayan commented Aug 18, 2024 • edited Loading

MahmoudAshraf97 Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

BBC-Esq Aug 20, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Aug 20, 2024

Choose a reason for hiding this comment

BBC-Esq Aug 20, 2024

Choose a reason for hiding this comment

ooobo Aug 21, 2024

Choose a reason for hiding this comment

BBC-Esq commented Aug 20, 2024 • edited Loading

BBC-Esq commented Aug 20, 2024 • edited Loading

Here it is - NOTE, this is for Intel cpus and would need to be modified to work with apple/amd:

EXPLANATION:

Apple/AMD

Conclusion:

ozancaglayan commented Aug 21, 2024 • edited Loading

BBC-Esq commented Aug 22, 2024

MahmoudAshraf97 commented Aug 22, 2024 • edited Loading

MahmoudAshraf97 left a comment

Choose a reason for hiding this comment

ooobo commented Oct 30, 2024

MahmoudAshraf97 commented Oct 30, 2024

ozancaglayan commented Aug 18, 2024 •

edited

Loading

MahmoudAshraf97 Aug 20, 2024 •

edited

Loading

BBC-Esq commented Aug 20, 2024 •

edited

Loading

BBC-Esq commented Aug 20, 2024 •

edited

Loading

ozancaglayan commented Aug 21, 2024 •

edited

Loading

MahmoudAshraf97 commented Aug 22, 2024 •

edited

Loading