Add Conv2d for CPU #14388

am17an · 2025-06-26T10:03:19Z

Follow up #14320, added a tilled Im2col + GEMM approach for kernels upon @Acly's excellent suggestion. This is inspired from his branch which does the CHWN conv, this one does the WHCN one, which is a bit different. ~~Only works for F32~~, but sizeable speedups. I will work on the f16 version as well, but putting this up for review

Performance

Input Size	Kernel	Config	IM2COL (ms)	SIMD (ms)	Speedup	Match	Type
224x224x3	7x7x3→64	s2 p3	11.988	8.594	1.39x SIMD	✅	F32
56x56x64	1x1x64→64	s1 p0	1.491	1.149	1.30x SIMD	✅	F32
56x56x64	3x3x64→64	s1 p1	6.683	3.417	1.96x SIMD	✅	F32
56x56x64	1x1x64→256	s1 p0	4.056	1.672	2.43x SIMD	✅	F32
28x28x256	1x1x256→128	s1 p0	1.162	0.865	1.34x SIMD	✅	F32
28x28x128	3x3x128→128	s1 p1	3.376	1.403	2.41x SIMD	✅	F32
28x28x128	1x1x128→512	s1 p0	2.010	1.407	1.43x SIMD	✅	F32
14x14x512	1x1x512→256	s1 p0	0.699	0.482	1.45x SIMD	✅	F32
14x14x256	3x3x256→256	s1 p1	2.260	1.043	2.17x SIMD	✅	F32
14x14x256	1x1x256→1024	s1 p0	1.347	1.099	1.23x SIMD	✅	F32
7x7x1024	1x1x1024→512	s1 p0	0.654	0.369	1.77x SIMD	✅	F32
7x7x512	3x3x512→512	s1 p1	2.564	1.044	2.46x SIMD	✅	F32
7x7x512	1x1x512→2048	s1 p0	1.612	0.855	1.89x SIMD	✅	F32
224x224x3	3x3x3→64	s1 p1	23.499	25.072	0.97 SIMD	✅	F32
224x224x16	3x3x16→64	s1 p1	34.263	24.667	1.39x SIMD	✅	F32
112x112x64	3x3x64→128	s1 p1	30.443	15.720	1.94x SIMD	✅	F32
112x112x16	3x3x16→16	s1 p1	7.382	1.543	4.78x SIMD	✅	F32
56x56x128	3x3x128→256	s1 p1	18.871	8.376	2.25x SIMD	✅	F32
56x56x256	3x3x256→256	s1 p1	34.579	16.293	2.12x SIMD	✅	F32
28x28x256	3x3x256→512	s1 p1	13.815	9.819	1.41x SIMD	✅	F32
28x28x512	3x3x512→512	s1 p1	26.577	15.285	1.74x SIMD	✅	F32
112x112x32	3x3x32→32	s1 p1	12.815	4.811	2.66x SIMD	✅	F32
112x112x32	1x1x32→64	s1 p0	4.261	2.128	2.00x SIMD	✅	F32
56x56x64	1x1x64→128	s1 p0	2.268	1.243	1.82x SIMD	✅	F32
28x28x128	1x1x128→256	s1 p0	1.241	0.969	1.28x SIMD	✅	F32
224x224x3	3x3x3→32	s2 p1	3.636	2.232	1.63x SIMD	✅	F32
112x112x16	3x3x16→96	s1 p1	11.007	4.500	2.45x SIMD	✅	F32
112x112x24	1x1x24→14	s1 p0	1.897	0.997	1.90x SIMD	✅	F32
56x56x40	1x1x40→24	s1 p0	0.676	0.287	2.36x SIMD	✅	F32
56x56x96	7x7x96→96	s1 p3	52.281	20.812	2.51x SIMD	✅	F32
28x28x192	7x7x192→192	s1 p3	28.765	12.747	2.26x SIMD	✅	F32
14x14x384	7x7x384→384	s1 p3	20.449	11.509	1.78x SIMD	✅	F32
224x224x3	3x3x3→64	s1 p1	110.537	78.612	1.41x SIMD	✅	F32
56x56x64	3x3x64→64	s1 p1	26.656	8.599	3.10x SIMD	✅	F32
28x28x128	1x1x128→64	s1 p0	3.096	0.831	3.73x SIMD	✅	F32
512x512x3	3x3x3→1	s1 p0	22.431	8.468	2.65x SIMD	✅	F32
256x512x3	3x3x3→9	s1 p0	16.887	9.996	1.69x SIMD	✅	F32
896x896x1	3x3x1→1	s1 p0	32.143	11.903	2.70x SIMD	✅	F32
224x224x3	3x3x3→768	s1 p1	218.972	201.482	1.09x SIMD	✅	F32

EDIT: added f16 version also

am17an · 2025-06-28T09:49:22Z

Now works for f16 operations also, speed up is less. Additional memory is consistent with GGML_IM2COL_WORK_SIZE

Input Size	Kernel	Config	IM2COL (ms)	SIMD (ms)	Speedup	Match	Type
112x112x32	2x2x32→64	s1 p0	6.472	6.797	0.95x	✅	F16
56x56x64	1x1x64→128	s1 p0	2.061	1.792	1.15x	✅	F16
112x112x32	3x3x32→32	s1 p1	8.819	6.706	1.32x	✅	F16
512x512x3	7x7x3→64	s2 p3	473.110	382.644	1.24x	✅	F16

Tested further by changing all ggml_conv2d to ggml_conv2d_direct in tools/mtmd/clip.cpp and running mtmd tests

+ set +x
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M

rmatif · 2025-06-28T14:42:34Z

Just tested it with stable-diffusion.cpp, couldn't notice any significant difference in the inference speed, but the memory savings during the VAE phase are huge, about 2.6x less memory usage: 640 MB vs 1664 MB for 512x512 and 2560 MB vs 6656 MB for 1024x1024

am17an · 2025-06-28T16:48:24Z

@rmatif thank you for testing!
Correct me if I'm wrong but I think conv2d is not the most time consuming step in stable diffusion? If so it probably won't improve inference speed noticeably, but it might perhaps for some convolution-heavy network.

ggml/src/ggml-cpu/ops.cpp

Acly · 2025-06-28T18:40:00Z

Looks great!

One thing to consider is to reuse/merge with the existing im2col code for the stand-alone OP_IM2COL, so there aren't 2 implementations doing essentially the same thing. However when I had a brief go at this, it turned out to require quite a lot of boilerplate code and made things somewhat less readable, so might not be worth it

am17an · 2025-06-29T02:41:43Z

One thing to consider is to reuse/merge with the existing im2col code for the stand-alone OP_IM2COL, so there aren't 2 implementations doing essentially the same thing. However when I had a brief go at this, it turned out to require quite a lot of boilerplate code and made things somewhat less readable, so might not be worth it

Yes I also thought of doing that and came to the same conclusion, it would be replacing just those for loops with a lot of boilerplate

rmatif · 2025-06-29T13:50:01Z

@rmatif thank you for testing! Correct me if I'm wrong but I think conv2d is not the most time consuming step in stable diffusion? If so it probably won't improve inference speed noticeably, but it might perhaps for some convolution-heavy network.

You're correct, at high resolution the attention blocks are the most dominant in compute time, but I guess the speedup is not that big to see it propagate at low resolution or even in the VAE phase where we have fewer attention blocks

CISC · 2025-06-29T14:57:04Z

ggml/src/ggml.c

@@ -984,7 +985,7 @@ static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
    "OPT_STEP_ADAMW",
 };

-static_assert(GGML_OP_COUNT == 84, "GGML_OP_COUNT != 84");
+static_assert(GGML_OP_COUNT == 85, "GGML_OP_COUNT != 85");


Heads up, even though GitHub gives no indication of it (since it's identical), this now conflicts with master.

I see! I ran into the same issue on a prior rebase of this PR. I will do a rebase once this is ready to merge

CISC · 2025-06-29T14:57:31Z

ggml/src/ggml.c

@@ -1081,7 +1083,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
    "adamw(x)",
 };

-static_assert(GGML_OP_COUNT == 84, "GGML_OP_COUNT != 84");
+static_assert(GGML_OP_COUNT == 85, "GGML_OP_COUNT != 85");


am17an requested a review from ggerganov June 26, 2025 10:03

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 26, 2025

am17an requested review from slaren and CISC June 26, 2025 13:36

am17an force-pushed the add_conv2d_cpu2 branch from 556da82 to 3afb657 Compare June 26, 2025 13:50

am17an changed the title ~~Add conv2d cpu2~~ Add conv2d - cpu version Jun 26, 2025

am17an changed the title ~~Add conv2d - cpu version~~ Add Conv2d for CPU Jun 26, 2025

am17an force-pushed the add_conv2d_cpu2 branch 2 times, most recently from dd75e85 to 29b85f1 Compare June 26, 2025 15:36

rmatif mentioned this pull request Jun 26, 2025

OpenCL: add conv2d kernel #14403

Open

am17an added 6 commits June 28, 2025 17:51

Conv2D: Add CPU version

b4af5d9

Half decent

e69f0c7

Tiled approach for F32

94d7418

remove file

ec7fcfc

Fix tests

96ded47

Support F16 operations

447e7b4

am17an force-pushed the add_conv2d_cpu2 branch from 8b5c0a4 to 447e7b4 Compare June 28, 2025 09:57

Acly reviewed Jun 28, 2025

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Show resolved Hide resolved

add assert about size

2871dfa

CISC reviewed Jun 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Conv2d for CPU #14388

Add Conv2d for CPU #14388

am17an commented Jun 26, 2025 •

edited

Loading

Uh oh!

am17an commented Jun 28, 2025 •

edited

Loading

Uh oh!

rmatif commented Jun 28, 2025 •

edited

Loading

Uh oh!

am17an commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Acly commented Jun 28, 2025

Uh oh!

am17an commented Jun 29, 2025

Uh oh!

rmatif commented Jun 29, 2025 •

edited

Loading

Uh oh!

CISC Jun 29, 2025

Uh oh!

am17an Jun 29, 2025 •

edited

Loading

Uh oh!

CISC Jun 29, 2025

Uh oh!

Uh oh!

Add Conv2d for CPU #14388

Are you sure you want to change the base?

Add Conv2d for CPU #14388

Conversation

am17an commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmatif commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Acly commented Jun 28, 2025

Uh oh!

am17an commented Jun 29, 2025

Uh oh!

rmatif commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

am17an Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CISC Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

am17an commented Jun 26, 2025 •

edited

Loading

am17an commented Jun 28, 2025 •

edited

Loading

rmatif commented Jun 28, 2025 •

edited

Loading

am17an commented Jun 28, 2025 •

edited

Loading

rmatif commented Jun 29, 2025 •

edited

Loading

am17an Jun 29, 2025 •

edited

Loading