You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I'm trying to start research using the model "TheBloke/Llama-2-70B-Chat-GGML".
Before trying with 2.70B model, I used 2.7B model for the test.
However, it seems like performance on CPU and GPU in the server has no big difference.
It took 8.4 seconds to generate text with 300 tokens when using CPU and it took 6.2 seconds with 307 tokens for GPU.
When I tracked the GPU usage, it was from 0% to 15% (about 3% on average).
Below are my questions:
What is the expected performance with this server and GPU?
It seems like GPU is not used entirely. Is it normal or is there any way to improve it?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I'm trying to start research using the model "TheBloke/Llama-2-70B-Chat-GGML".
Before trying with 2.70B model, I used 2.7B model for the test.
However, it seems like performance on CPU and GPU in the server has no big difference.
It took 8.4 seconds to generate text with 300 tokens when using CPU and it took 6.2 seconds with 307 tokens for GPU.
When I tracked the GPU usage, it was from 0% to 15% (about 3% on average).
Below are my questions:
Thank you for your time.
Beta Was this translation helpful? Give feedback.
All reactions