CPU bound performance using GPU based chat #3874

supersteves · 2024-10-18T10:06:07Z

supersteves
Oct 18, 2024

I'm using LocalAI with qwen2.5-32b-instruct (4-bit quantized) to do chat completion. I'm roughly testing performance using the LocalAI chat interface, asking it to translate a 16-line poem to French, after loading and warming up the model first. It's a silly test case I'm using to get reasonably deterministic results for this model. I'm monitoring resource usage with nvidia-smi, top and iotop.

I'm comparing two cloud instances of Google G2 machine type with Nvidia L4 GPUs.

g2-standard-4, 4 vCPU (2 real CPU cores), 16gb memory, 1x GPU
g2-standard-24, 24 vCPU (12 real CPU cores), 96gb memory, 2x GPU

I'm using Ubuntu with Docker, after installing the recommended ubuntu nvidia drivers and the container toolkit. Running as follows.

docker run -d -p 8080:8080 --gpus all --name local-ai -ti localai/localai:latest-gpu-nvidia-cuda-12 run qwen2.5-32b-instruct --debug

My test is to translate a given 16 line poem to French (after loading and warming up the model first).

With machine 1, I get around 80% GPU usage, and 100% CPU (i.e. one whole core at 100%) in top.
With machine 2, I get around 40% GPU usage for each of the 2 GPUs, and 100% CPU.

(iotop shows nothing significant.)

In both cases, the test takes exactly 20 seconds to finish responding.

So it doesn't matter how many GPUs or CPUs I have, I'm hitting a CPU bottleneck. Some of the work needed is happening in a single thread on the CPU, and is preventing me from throwing more resource at the problem to get better results. The tokenizer, perhaps? I am not hugely familiar with what's happening behind the scenes.

Theoretically, since both machines have spare cores and if they were used in full the GPU would be the bottleneck, as such I should be getting:

~1.25x better performance on machine 1, i.e. 16 seconds
~2.5x better performance on machine 2, i.e. 8 seconds

I've tried playing with various options but may not have hit the right one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU bound performance using GPU based chat #3874

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CPU bound performance using GPU based chat #3874

supersteves Oct 18, 2024

Replies: 0 comments

supersteves
Oct 18, 2024