[Performance] High CPU usage of inference in python no-gil and TensorRT EP

### Describe the issue

### Summary
- The problem is Only a few of our server has high cpu usage when doing inference, using onnxruntime-gpu
- From profiling, I found the cpu usage is from [spinPause](https://github.com/microsoft/onnxruntime/blob/9c834b288fe95c37dd11525b9775dc9ee1933061/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h#L1560) for its intra thread pool.
- But our model is only using GPU for its computation, with TensorRT Execution provider, without any cpu computation.
- I can't understand why this happens.

## Details
### Environment Setup
**System Architecture**
- I am using onnxruntime in a production environment with the following process/thread structure:
- Web Server: Gunicorn worker process (handling HTTP requests/responses) in python**3.13t**
- Task Distribution: multiprocessing.Queue for bi-directional communication between the Gunicorn process and inference workers.
- Inference Workers: 
  - Two dedicated Inference Threads are spawned.
  - Each thread initializes and maintains two separate onnxruntime.InferenceSession instances with TensorRT EP: one for Batch 1 and another for Batch 4.

<img width="1223" height="671" alt="Image" src="https://github.com/user-attachments/assets/c4714b3e-034e-4c18-9035-a619fc4a516b" />

**Inference Workflow**
- Gunicorn Process receives a request, packages it as a task, and puts it into the multiprocessing.Queue.
- One of the Inference Threads retrieves the task from the queue.
- The thread selects the appropriate ONNX session (Batch 1 or 4) based on the task requirement and executes inference.
- The inference result is sent back to the Gunicorn Process via the queue.
- Gunicorn returns the final response to the client.

### Problem and mystery
- **Problem**: A few of instance (about 3 in 100) has high cpu usage. (e.g. normally 3% but above 50%)
- **Mystery**
  - I guess this problem happens from the thread pool for infra-op.
  - Also, It seems the threads seems to sleep and wake up repeatedly. But threads wake up together only from its [destructor](https://github.com/microsoft/onnxruntime/blob/9c834b288fe95c37dd11525b9775dc9ee1933061/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h#L809)
  - Importantly, our model only uses GPU with TensorRT EP, **no use CPU computation** 


### Clues
- **Clue1**: high consumption happens in `onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so`

<img width="806" height="193" alt="Image" src="https://github.com/user-attachments/assets/ff35e794-6e45-4c69-8992-99ecc3b92378" />

- **Clue2**: `0x0000000000bbd6b3` in `onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so` is [spinPause](https://github.com/microsoft/onnxruntime/blob/9c834b288fe95c37dd11525b9775dc9ee1933061/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h#L1560)

```
root@dc97bf6c52fc:/opt# objdump -d --start-address=0xbbd300 --stop-address=0xbbd700   /usr/local/lib/python3.13t/dist-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so
...
<PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8baf16>
  bbd6a3:       41 8b 84 24 90 00 00    mov    0x90(%r12),%eax
  bbd6aa:       00
  bbd6ab:       85 c0                   test   %eax,%eax
  bbd6ad:       0f 84 dd 07 00 00       je     bbde90 <PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8baf20>
  bbd6b3:       f3 90                   pause
  bbd6b5:       e9 36 fd ff ff          jmp    bbd3f0 <PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8ba480>
  bbd6ba:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  bbd6c0:       44 89 c0                mov    %r8d,%eax
...
```

- **Clue3**: Some threads repeatedly get sleep and wake up
```
$ sudo strace -p 30191
strace: Process 30191 attached
futex(0x7f10386c30d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386c3088, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10387aca58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10386f5d58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10387cb1dc, FUTEX_WAKE_PRIVATE, 1) = 1
...
futex(0x7f10386c30d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386c3088, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10387aca58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10386f5d58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10387cb1dc, FUTEX_WAKE_PRIVATE, 1) = 1


$ sudo strace -p 30193
strace: Process 30193 attached
futex(0x7f10386d75d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386d7588, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10386d75dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
...
```

- **Clue4**: Our model uses only GPU computation with TensorRT EP

```
&&& RUNNING TensorRT.trtexec [TensorRT v100900] [b34] # /usr/src/tensorrt/bin/trtexec --loadEngine=TensorrtExecutionProvider_TRTKernel_graph_main_graph_9777126876712665165_0_0_sm75.engine --dumpLayerInfo

[12/11/2025-17:10:41] [I] === Model Options ===
[12/11/2025-17:10:41] [I] Format: *
[12/11/2025-17:10:41] [I] Model:
...
```


## Conclusion
- I want to understand what this happens and how to solve it.
- I tried to solve this by using this setting. I got the below but couldn't understand why this happend
  - The load average of CPU increased in high load.
  - But the inference latency decreased.
  - And the high cpu usage disappeared but increase a little in a normal case.
```
sess_options = ort.SessionOptions() 
sess_options.intra_op_num_threads = 1 sess_options.add_session_config_entry("session.intra_op.allow_spinning", "0")
```







### To reproduce

Hard to reproduce. It happens probabilistically.

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 22.04

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.22

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

TensorRT

### Execution Provider Library Version

CUDA 12.6, TensorRT 10.9

### Model File

_No response_

### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] High CPU usage of inference in python no-gil and TensorRT EP #26847

Describe the issue

Summary

Details

Environment Setup

Problem and mystery

Clues

Conclusion

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] High CPU usage of inference in python no-gil and TensorRT EP #26847

Description

Describe the issue

Summary

Details

Environment Setup

Problem and mystery

Clues

Conclusion

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions