Skip to content

[Performance] High CPU usage of inference in python no-gil and TensorRT EP #26847

@nimdrak

Description

@nimdrak

Describe the issue

Summary

  • The problem is Only a few of our server has high cpu usage when doing inference, using onnxruntime-gpu
  • From profiling, I found the cpu usage is from spinPause for its intra thread pool.
  • But our model is only using GPU for its computation, with TensorRT Execution provider, without any cpu computation.
  • I can't understand why this happens.

Details

Environment Setup

System Architecture

  • I am using onnxruntime in a production environment with the following process/thread structure:
  • Web Server: Gunicorn worker process (handling HTTP requests/responses) in python3.13t
  • Task Distribution: multiprocessing.Queue for bi-directional communication between the Gunicorn process and inference workers.
  • Inference Workers:
    • Two dedicated Inference Threads are spawned.
    • Each thread initializes and maintains two separate onnxruntime.InferenceSession instances with TensorRT EP: one for Batch 1 and another for Batch 4.
Image

Inference Workflow

  • Gunicorn Process receives a request, packages it as a task, and puts it into the multiprocessing.Queue.
  • One of the Inference Threads retrieves the task from the queue.
  • The thread selects the appropriate ONNX session (Batch 1 or 4) based on the task requirement and executes inference.
  • The inference result is sent back to the Gunicorn Process via the queue.
  • Gunicorn returns the final response to the client.

Problem and mystery

  • Problem: A few of instance (about 3 in 100) has high cpu usage. (e.g. normally 3% but above 50%)
  • Mystery
    • I guess this problem happens from the thread pool for infra-op.
    • Also, It seems the threads seems to sleep and wake up repeatedly. But threads wake up together only from its destructor
    • Importantly, our model only uses GPU with TensorRT EP, no use CPU computation

Clues

  • Clue1: high consumption happens in onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so
Image
  • Clue2: 0x0000000000bbd6b3 in onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so is spinPause
root@dc97bf6c52fc:/opt# objdump -d --start-address=0xbbd300 --stop-address=0xbbd700   /usr/local/lib/python3.13t/dist-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-313t-x86_64-linux-gnu.so
...
<PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8baf16>
  bbd6a3:       41 8b 84 24 90 00 00    mov    0x90(%r12),%eax
  bbd6aa:       00
  bbd6ab:       85 c0                   test   %eax,%eax
  bbd6ad:       0f 84 dd 07 00 00       je     bbde90 <PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8baf20>
  bbd6b3:       f3 90                   pause
  bbd6b5:       e9 36 fd ff ff          jmp    bbd3f0 <PyInit_onnxruntime_pybind11_state@@VERS_1.0+0x8ba480>
  bbd6ba:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  bbd6c0:       44 89 c0                mov    %r8d,%eax
...
  • Clue3: Some threads repeatedly get sleep and wake up
$ sudo strace -p 30191
strace: Process 30191 attached
futex(0x7f10386c30d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386c3088, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10387aca58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10386f5d58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10387cb1dc, FUTEX_WAKE_PRIVATE, 1) = 1
...
futex(0x7f10386c30d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386c3088, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10387aca58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10386f5d58, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f10387cb1dc, FUTEX_WAKE_PRIVATE, 1) = 1


$ sudo strace -p 30193
strace: Process 30193 attached
futex(0x7f10386d75d8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7f10386d7588, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f10386d75dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
...
  • Clue4: Our model uses only GPU computation with TensorRT EP
&&& RUNNING TensorRT.trtexec [TensorRT v100900] [b34] # /usr/src/tensorrt/bin/trtexec --loadEngine=TensorrtExecutionProvider_TRTKernel_graph_main_graph_9777126876712665165_0_0_sm75.engine --dumpLayerInfo

[12/11/2025-17:10:41] [I] === Model Options ===
[12/11/2025-17:10:41] [I] Format: *
[12/11/2025-17:10:41] [I] Model:
...

Conclusion

  • I want to understand what this happens and how to solve it.
  • I tried to solve this by using this setting. I got the below but couldn't understand why this happend
    • The load average of CPU increased in high load.
    • But the inference latency decreased.
    • And the high cpu usage disappeared but increase a little in a normal case.
sess_options = ort.SessionOptions() 
sess_options.intra_op_num_threads = 1 sess_options.add_session_config_entry("session.intra_op.allow_spinning", "0")

To reproduce

Hard to reproduce. It happens probabilistically.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.22

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 12.6, TensorRT 10.9

Model File

No response

Is this a quantized model?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:TensorRTissues related to TensorRT execution providerperformanceissues related to performance regressions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions