-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Open
Labels
Description
Name and Version
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
version: 1941 (ce111d3)
built with Ubuntu clang version 18.1.3 (1ubuntu1) for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 5070ti
Models
No response
Problem description & steps to reproduce
When using some models I have speed regression using -fa and -ctk q8_0
Examples :
llama-bench -m ../models/LGAI-EXAONE_EXAONE-4.0-1.2B-Q8_0.gguf -fa 0,1 -ctk q8_0
model | size | params | backend | ngl | type_k | fa | test | t/s |
---|---|---|---|---|---|---|---|---|
exaone4 1.2B Q8_0 | 1.27 GiB | 1.28 B | CUDA | 99 | q8_0 | 0 | pp512 | 18650.87 ± 98.91 |
exaone4 1.2B Q8_0 | 1.27 GiB | 1.28 B | CUDA | 99 | q8_0 | 0 | tg128 | 302.96 ± 0.43 |
exaone4 1.2B Q8_0 | 1.27 GiB | 1.28 B | CUDA | 99 | q8_0 | 1 | pp512 | 1039.32 ± 78.99 |
exaone4 1.2B Q8_0 | 1.27 GiB | 1.28 B | CUDA | 99 | q8_0 | 1 | tg128 | 111.49 ± 11.46 |
llama-bench -m ../models/gemma-3n-E4B-it-UD-Q4_K_XL.gguf -fa 0,1 -ctk q8_0
model | size | params | backend | ngl | type_k | fa | test | t/s |
---|---|---|---|---|---|---|---|---|
gemma3n E4B Q4_K - Medium | 5.01 GiB | 6.87 B | CUDA | 99 | q8_0 | 0 | pp512 | 4810.79 ± 23.46 |
gemma3n E4B Q4_K - Medium | 5.01 GiB | 6.87 B | CUDA | 99 | q8_0 | 0 | tg128 | 99.83 ± 0.77 |
gemma3n E4B Q4_K - Medium | 5.01 GiB | 6.87 B | CUDA | 99 | q8_0 | 1 | pp512 | 1235.63 ± 7.37 |
gemma3n E4B Q4_K - Medium | 5.01 GiB | 6.87 B | CUDA | 99 | q8_0 | 1 | tg128 | 56.76 ± 0.24 |
Others are not affected :
llama-bench -m ../models/SmolLM3-Q4_K_M.gguf -fa 0,1 -ctk q8_0
model | size | params | backend | ngl | type_k | fa | test | t/s |
---|---|---|---|---|---|---|---|---|
smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | CUDA | 99 | q8_0 | 0 | pp512 | 11547.97 ± 35.34 |
smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | CUDA | 99 | q8_0 | 0 | tg128 | 234.68 ± 0.22 |
smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | CUDA | 99 | q8_0 | 1 | pp512 | 12815.32 ± 8.10 |
smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | CUDA | 99 | q8_0 | 1 | tg128 | 243.48 ± 0.27 |
First Bad Commit
No response
Relevant log output
n/a