-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
ep:TensorRTissues related to TensorRT execution providerissues related to TensorRT execution providerperformanceissues related to performance regressionsissues related to performance regressions
Description
Describe the issue
Summary
- The problem is Only a few of our server has high cpu usage when doing inference, using onnxruntime-gpu
- From profiling, I found the cpu usage is from spinPause for its intra thread pool.
- But our model is only using GPU for its computation, with TensorRT Execution provider, without any cpu computation.
- I can't understand why this happens.
Details
Environment Setup
System Architecture
- I am using onnxruntime in a production environment with the following process/thread structure:
- Web Server: Gunicorn worker process (handling HTTP requests/responses) in python3.13t
- Task Distribution: multiprocessing.Queue for bi-directional communication between the Gunicorn process and inference workers.
- Inference Workers:
- Two dedicated Inference Threads are spawned.
- Each thread initializes and maintains two separate onnxruntime.InferenceSession instances with TensorRT EP: one for Batch 1 and another for Batch 4.
Inference Workflow
- Gunicorn Process receives a request, packages it as a task, and puts it into the multiprocessing.Queue.
- One of the Inference Threads retrieves the task from the queue.
- The thread selects the appropriate ONNX session (Batch 1 or 4) based on the task requirement and executes inference.
- The inference result is sent back to the Gunicorn Process via the queue.
- Gunicorn returns the final response to the client.
Problem and mystery
- Problem: A few of instance (about 3 in 100) has high cpu usage. (e.g. normally 3% but above 50%)
- Mystery
- I guess this problem happens from the thread pool for infra-op.
- Also, It seems the threads seems to sleep and wake up repeatedly. But threads wake up together only from its destructor
- Importantly, our model only uses GPU with TensorRT EP, no use CPU computation
Clues
- Clue1: high consumption happens in
onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so
- Clue2:
0x0000000000bbd6b3inonnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.sois spinPause
root@dc97bf6c52fc:/opt# objdump -d --start-address=0xbbd300 --stop-address=0xbbd700 /usr/local/lib/python3.13t/dist-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so
...
<PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8baf16>
bbd6a3: 41 8b 84 24 90 00 00 mov 0x90(%r12),%eax
bbd6aa: 00
bbd6ab: 85 c0 test %eax,%eax
bbd6ad: 0f 84 dd 07 00 00 je bbde90 <PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8baf20>
bbd6b3: f3 90 pause
bbd6b5: e9 36 fd ff ff jmp bbd3f0 <PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8ba480>
bbd6ba: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
bbd6c0: 44 89 c0 mov %r8d,%eax
...
- Clue3: Some threads repeatedly get sleep and wake up
$ sudo strace -p 30191
strace: Process 30191 attached
futex(0x7f10386c30d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386c3088, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10387aca58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10386f5d58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10387cb1dc, FUTEX_WAKE_PRIVATE, 1) = 1
...
futex(0x7f10386c30d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386c3088, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10387aca58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10386f5d58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10387cb1dc, FUTEX_WAKE_PRIVATE, 1) = 1
$ sudo strace -p 30193
strace: Process 30193 attached
futex(0x7f10386d75d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386d7588, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10386d75dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
...
- Clue4: Our model uses only GPU computation with TensorRT EP
&&& RUNNING TensorRT.trtexec [TensorRT v100900] [b34] # /usr/src/tensorrt/bin/trtexec --loadEngine=TensorrtExecutionProvider_TRTKernel_graph_main_graph_9777126876712665165_0_0_sm75.engine --dumpLayerInfo
[12/11/2025-17:10:41] [I] === Model Options ===
[12/11/2025-17:10:41] [I] Format: *
[12/11/2025-17:10:41] [I] Model:
...
Conclusion
- I want to understand what this happens and how to solve it.
- I tried to solve this by using this setting. I got the below but couldn't understand why this happend
- The load average of CPU increased in high load.
- But the inference latency decreased.
- And the high cpu usage disappeared but increase a little in a normal case.
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 1 sess_options.add_session_config_entry("session.intra_op.allow_spinning", "0")
To reproduce
Hard to reproduce. It happens probabilistically.
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.22
ONNX Runtime API
Python
Architecture
X64
Execution Provider
TensorRT
Execution Provider Library Version
CUDA 12.6, TensorRT 10.9
Model File
No response
Is this a quantized model?
No
Metadata
Metadata
Assignees
Labels
ep:TensorRTissues related to TensorRT execution providerissues related to TensorRT execution providerperformanceissues related to performance regressionsissues related to performance regressions