[Performance] Observing higher memory spikes in C++ when running multiple Inference Run()
executions on CPU
#22920
Labels
performance
issues related to performance regressions
platform:mobile
issues related to ONNX Runtime mobile; typically submitted using template
Describe the issue
Description:
I am observing high memory spikes after each run of the inference session when passing the previous outputs of the inference to the input of another iteration of the inference. This happens when the input values change during each iteration of the generation loop. The memory usage increases significantly after every
Run()
invocation, and the allocated memory becomes larger and larger. In contrast, the Python version of my code does not exhibit these spikes, and memory usage remains more or less linearly stable. This suggests that there may be an issue related to memory management in the C++ version, possibly tied to the memory arena, memory patterns, or session configuration.Environment:
What I've Tried:
Memory Arena & Arena Configuration:
ArenaCfg
and registered a custom allocator, but this did not solve the problem.Session Options:
inter_op_num_threads
andintra_op_num_threads
to1
.session.use_device_allocator_for_initializers = 1
in session options.memory arena shrinkage
oncpu:0
for session run, but the high memory allocation continues.Input & Output Memory Size:
Additional Information:
Question:
How can I better manage memory usage during inference sessions when using a dynamically changing inputs in ONNX Runtime for the C++ API? Are there specific settings or techniques for reducing memory spikes that I may have missed? The same implementation does not persist in python version.
To reproduce
Use InferenceSession.Run() multiple times in a loop, perhaps by using dynamic increasing input shapes. Quantized ONNX model exported from torch.
Observations:
Run()
.Expected Behavior:
Run()
.Actual Behavior:
Run()
invocation. The memory allocated grows larger with each iteration and not linear.Urgency
No response
Platform
Android
OS Version
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
ONNX Runtime 18
ONNX Runtime API
C++
Architecture
ARM64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: