CPU bound performance using GPU based chat #3874
Unanswered
supersteves
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm using LocalAI with
qwen2.5-32b-instruct
(4-bit quantized) to do chat completion. I'm roughly testing performance using the LocalAI chat interface, asking it to translate a 16-line poem to French, after loading and warming up the model first. It's a silly test case I'm using to get reasonably deterministic results for this model. I'm monitoring resource usage withnvidia-smi
,top
andiotop
.I'm comparing two cloud instances of Google G2 machine type with Nvidia L4 GPUs.
I'm using Ubuntu with Docker, after installing the recommended ubuntu nvidia drivers and the container toolkit. Running as follows.
My test is to translate a given 16 line poem to French (after loading and warming up the model first).
top
.(
iotop
shows nothing significant.)In both cases, the test takes exactly 20 seconds to finish responding.
So it doesn't matter how many GPUs or CPUs I have, I'm hitting a CPU bottleneck. Some of the work needed is happening in a single thread on the CPU, and is preventing me from throwing more resource at the problem to get better results. The tokenizer, perhaps? I am not hugely familiar with what's happening behind the scenes.
Theoretically, since both machines have spare cores and if they were used in full the GPU would be the bottleneck, as such I should be getting:
I've tried playing with various options but may not have hit the right one.
Beta Was this translation helpful? Give feedback.
All reactions