Performance: LLAMA-2.7B on RTX A6000 - CPU Intel Xeon Gold 6330, 1 TB RAM #531

HZie · 2023-09-27T03:12:02Z

HZie
Sep 27, 2023

Hi,
I'm trying to start research using the model "TheBloke/Llama-2-70B-Chat-GGML".
Before trying with 2.70B model, I used 2.7B model for the test.
However, it seems like performance on CPU and GPU in the server has no big difference.
It took 8.4 seconds to generate text with 300 tokens when using CPU and it took 6.2 seconds with 307 tokens for GPU.
When I tracked the GPU usage, it was from 0% to 15% (about 3% on average).

Below are my questions:

What is the expected performance with this server and GPU?
It seems like GPU is not used entirely. Is it normal or is there any way to improve it?
Could you share your performance for comparison?

Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: LLAMA-2.7B on RTX A6000 - CPU Intel Xeon Gold 6330, 1 TB RAM #531

{{title}}

Replies: 0 comments

Select a reply

Performance: LLAMA-2.7B on RTX A6000 - CPU Intel Xeon Gold 6330, 1 TB RAM #531

HZie Sep 27, 2023

Replies: 0 comments

HZie
Sep 27, 2023