Speed regression with -fa and -ctk

### Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
version: 1941 (ce111d39)
built with Ubuntu clang version 18.1.3 (1ubuntu1) for x86_64-pc-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

RTX 5070ti



### Models

_No response_

### Problem description & steps to reproduce

When using some models I have speed regression  using -fa and -ctk q8_0

Examples : 

llama-bench -m ../models/LGAI-EXAONE_EXAONE-4.0-1.2B-Q8_0.gguf -fa 0,1 -ctk q8_0

| model                          |       size |     params | backend    | ngl | type_k | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| exaone4 1.2B Q8_0              |   1.27 GiB |     1.28 B | CUDA       |  99 |   q8_0 |  0 |           pp512 |     18650.87 ± 98.91 |
| exaone4 1.2B Q8_0              |   1.27 GiB |     1.28 B | CUDA       |  99 |   q8_0 |  0 |           tg128 |        302.96 ± 0.43 |
| exaone4 1.2B Q8_0              |   1.27 GiB |     1.28 B | CUDA       |  99 |   q8_0 |  1 |           pp512 |      1039.32 ± 78.99 |
| exaone4 1.2B Q8_0              |   1.27 GiB |     1.28 B | CUDA       |  99 |   q8_0 |  1 |           tg128 |       111.49 ± 11.46 |

llama-bench -m ../models/gemma-3n-E4B-it-UD-Q4_K_XL.gguf -fa 0,1 -ctk q8_0

| model                          |       size |     params | backend    | ngl | type_k | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| gemma3n E4B Q4_K - Medium      |   5.01 GiB |     6.87 B | CUDA       |  99 |   q8_0 |  0 |           pp512 |      4810.79 ± 23.46 |
| gemma3n E4B Q4_K - Medium      |   5.01 GiB |     6.87 B | CUDA       |  99 |   q8_0 |  0 |           tg128 |         99.83 ± 0.77 |
| gemma3n E4B Q4_K - Medium      |   5.01 GiB |     6.87 B | CUDA       |  99 |   q8_0 |  1 |           pp512 |       1235.63 ± 7.37 |
| gemma3n E4B Q4_K - Medium      |   5.01 GiB |     6.87 B | CUDA       |  99 |   q8_0 |  1 |           tg128 |         56.76 ± 0.24 |


Others are not affected : 

llama-bench -m ../models/SmolLM3-Q4_K_M.gguf -fa 0,1 -ctk q8_0

| model                          |       size |     params | backend    | ngl | type_k | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | CUDA       |  99 |   q8_0 |  0 |           pp512 |     11547.97 ± 35.34 |
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | CUDA       |  99 |   q8_0 |  0 |           tg128 |        234.68 ± 0.22 |
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | CUDA       |  99 |   q8_0 |  1 |           pp512 |      12815.32 ± 8.10 |
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | CUDA       |  99 |   q8_0 |  1 |           tg128 |        243.48 ± 0.27 |

### First Bad Commit

_No response_

### Relevant log output

```shell
n/a
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed regression with -fa and -ctk #14881

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	type_k	fa	test	t/s
exaone4 1.2B Q8_0	1.27 GiB	1.28 B	CUDA	99	q8_0	0	pp512	18650.87 ± 98.91
exaone4 1.2B Q8_0	1.27 GiB	1.28 B	CUDA	99	q8_0	0	tg128	302.96 ± 0.43
exaone4 1.2B Q8_0	1.27 GiB	1.28 B	CUDA	99	q8_0	1	pp512	1039.32 ± 78.99
exaone4 1.2B Q8_0	1.27 GiB	1.28 B	CUDA	99	q8_0	1	tg128	111.49 ± 11.46

model	size	params	backend	ngl	type_k	fa	test	t/s
gemma3n E4B Q4_K - Medium	5.01 GiB	6.87 B	CUDA	99	q8_0	0	pp512	4810.79 ± 23.46
gemma3n E4B Q4_K - Medium	5.01 GiB	6.87 B	CUDA	99	q8_0	0	tg128	99.83 ± 0.77
gemma3n E4B Q4_K - Medium	5.01 GiB	6.87 B	CUDA	99	q8_0	1	pp512	1235.63 ± 7.37
gemma3n E4B Q4_K - Medium	5.01 GiB	6.87 B	CUDA	99	q8_0	1	tg128	56.76 ± 0.24

model	size	params	backend	ngl	type_k	fa	test	t/s
smollm3 3B Q4_K - Medium	1.78 GiB	3.08 B	CUDA	99	q8_0	0	pp512	11547.97 ± 35.34
smollm3 3B Q4_K - Medium	1.78 GiB	3.08 B	CUDA	99	q8_0	0	tg128	234.68 ± 0.22
smollm3 3B Q4_K - Medium	1.78 GiB	3.08 B	CUDA	99	q8_0	1	pp512	12815.32 ± 8.10
smollm3 3B Q4_K - Medium	1.78 GiB	3.08 B	CUDA	99	q8_0	1	tg128	243.48 ± 0.27

Speed regression with -fa and -ctk #14881

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions