Skip to content

Add Conv2d for CPU #14388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Add Conv2d for CPU #14388

wants to merge 7 commits into from

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Jun 26, 2025

Follow up #14320, added a tilled Im2col + GEMM approach for kernels upon @Acly's excellent suggestion. This is inspired from his branch which does the CHWN conv, this one does the WHCN one, which is a bit different. Only works for F32, but sizeable speedups. I will work on the f16 version as well, but putting this up for review

Performance
Input Size Kernel Config IM2COL (ms) SIMD (ms) Speedup Match Type
224x224x3 7x7x3→64 s2 p3 11.988 8.594 1.39x SIMD F32
56x56x64 1x1x64→64 s1 p0 1.491 1.149 1.30x SIMD F32
56x56x64 3x3x64→64 s1 p1 6.683 3.417 1.96x SIMD F32
56x56x64 1x1x64→256 s1 p0 4.056 1.672 2.43x SIMD F32
28x28x256 1x1x256→128 s1 p0 1.162 0.865 1.34x SIMD F32
28x28x128 3x3x128→128 s1 p1 3.376 1.403 2.41x SIMD F32
28x28x128 1x1x128→512 s1 p0 2.010 1.407 1.43x SIMD F32
14x14x512 1x1x512→256 s1 p0 0.699 0.482 1.45x SIMD F32
14x14x256 3x3x256→256 s1 p1 2.260 1.043 2.17x SIMD F32
14x14x256 1x1x256→1024 s1 p0 1.347 1.099 1.23x SIMD F32
7x7x1024 1x1x1024→512 s1 p0 0.654 0.369 1.77x SIMD F32
7x7x512 3x3x512→512 s1 p1 2.564 1.044 2.46x SIMD F32
7x7x512 1x1x512→2048 s1 p0 1.612 0.855 1.89x SIMD F32
224x224x3 3x3x3→64 s1 p1 23.499 25.072 0.97 SIMD F32
224x224x16 3x3x16→64 s1 p1 34.263 24.667 1.39x SIMD F32
112x112x64 3x3x64→128 s1 p1 30.443 15.720 1.94x SIMD F32
112x112x16 3x3x16→16 s1 p1 7.382 1.543 4.78x SIMD F32
56x56x128 3x3x128→256 s1 p1 18.871 8.376 2.25x SIMD F32
56x56x256 3x3x256→256 s1 p1 34.579 16.293 2.12x SIMD F32
28x28x256 3x3x256→512 s1 p1 13.815 9.819 1.41x SIMD F32
28x28x512 3x3x512→512 s1 p1 26.577 15.285 1.74x SIMD F32
112x112x32 3x3x32→32 s1 p1 12.815 4.811 2.66x SIMD F32
112x112x32 1x1x32→64 s1 p0 4.261 2.128 2.00x SIMD F32
56x56x64 1x1x64→128 s1 p0 2.268 1.243 1.82x SIMD F32
28x28x128 1x1x128→256 s1 p0 1.241 0.969 1.28x SIMD F32
224x224x3 3x3x3→32 s2 p1 3.636 2.232 1.63x SIMD F32
112x112x16 3x3x16→96 s1 p1 11.007 4.500 2.45x SIMD F32
112x112x24 1x1x24→14 s1 p0 1.897 0.997 1.90x SIMD F32
56x56x40 1x1x40→24 s1 p0 0.676 0.287 2.36x SIMD F32
56x56x96 7x7x96→96 s1 p3 52.281 20.812 2.51x SIMD F32
28x28x192 7x7x192→192 s1 p3 28.765 12.747 2.26x SIMD F32
14x14x384 7x7x384→384 s1 p3 20.449 11.509 1.78x SIMD F32
224x224x3 3x3x3→64 s1 p1 110.537 78.612 1.41x SIMD F32
56x56x64 3x3x64→64 s1 p1 26.656 8.599 3.10x SIMD F32
28x28x128 1x1x128→64 s1 p0 3.096 0.831 3.73x SIMD F32
512x512x3 3x3x3→1 s1 p0 22.431 8.468 2.65x SIMD F32
256x512x3 3x3x3→9 s1 p0 16.887 9.996 1.69x SIMD F32
896x896x1 3x3x1→1 s1 p0 32.143 11.903 2.70x SIMD F32
224x224x3 3x3x3→768 s1 p1 218.972 201.482 1.09x SIMD F32

EDIT: added f16 version also

@am17an am17an requested a review from ggerganov June 26, 2025 10:03
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 26, 2025
@am17an am17an requested review from slaren and CISC June 26, 2025 13:36
@am17an am17an force-pushed the add_conv2d_cpu2 branch from 556da82 to 3afb657 Compare June 26, 2025 13:50
@am17an am17an changed the title Add conv2d cpu2 Add conv2d - cpu version Jun 26, 2025
@am17an am17an changed the title Add conv2d - cpu version Add Conv2d for CPU Jun 26, 2025
@am17an am17an force-pushed the add_conv2d_cpu2 branch 2 times, most recently from dd75e85 to 29b85f1 Compare June 26, 2025 15:36
@am17an
Copy link
Collaborator Author

am17an commented Jun 28, 2025

Now works for f16 operations also, speed up is less. Additional memory is consistent with GGML_IM2COL_WORK_SIZE

Input Size Kernel Config IM2COL (ms) SIMD (ms) Speedup Match Type
112x112x32 2x2x32→64 s1 p0 6.472 6.797 0.95x F16
56x56x64 1x1x64→128 s1 p0 2.061 1.792 1.15x F16
112x112x32 3x3x32→32 s1 p1 8.819 6.706 1.32x F16
512x512x3 7x7x3→64 s2 p3 473.110 382.644 1.24x F16

Tested further by changing all ggml_conv2d to ggml_conv2d_direct in tools/mtmd/clip.cpp and running mtmd tests

+ set +x
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M

@am17an am17an force-pushed the add_conv2d_cpu2 branch from 8b5c0a4 to 447e7b4 Compare June 28, 2025 09:57
@rmatif
Copy link
Contributor

rmatif commented Jun 28, 2025

Just tested it with stable-diffusion.cpp, couldn't notice any significant difference in the inference speed, but the memory savings during the VAE phase are huge, about 2.6x less memory usage: 640 MB vs 1664 MB for 512x512 and 2560 MB vs 6656 MB for 1024x1024

@am17an
Copy link
Collaborator Author

am17an commented Jun 28, 2025

@rmatif thank you for testing!
Correct me if I'm wrong but I think conv2d is not the most time consuming step in stable diffusion? If so it probably won't improve inference speed noticeably, but it might perhaps for some convolution-heavy network.

@Acly
Copy link
Contributor

Acly commented Jun 28, 2025

Looks great!

One thing to consider is to reuse/merge with the existing im2col code for the stand-alone OP_IM2COL, so there aren't 2 implementations doing essentially the same thing. However when I had a brief go at this, it turned out to require quite a lot of boilerplate code and made things somewhat less readable, so might not be worth it

@am17an
Copy link
Collaborator Author

am17an commented Jun 29, 2025

One thing to consider is to reuse/merge with the existing im2col code for the stand-alone OP_IM2COL, so there aren't 2 implementations doing essentially the same thing. However when I had a brief go at this, it turned out to require quite a lot of boilerplate code and made things somewhat less readable, so might not be worth it

Yes I also thought of doing that and came to the same conclusion, it would be replacing just those for loops with a lot of boilerplate

@rmatif
Copy link
Contributor

rmatif commented Jun 29, 2025

@rmatif thank you for testing! Correct me if I'm wrong but I think conv2d is not the most time consuming step in stable diffusion? If so it probably won't improve inference speed noticeably, but it might perhaps for some convolution-heavy network.

You're correct, at high resolution the attention blocks are the most dominant in compute time, but I guess the speedup is not that big to see it propagate at low resolution or even in the VAE phase where we have fewer attention blocks

@@ -984,7 +985,7 @@ static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
"OPT_STEP_ADAMW",
};

static_assert(GGML_OP_COUNT == 84, "GGML_OP_COUNT != 84");
static_assert(GGML_OP_COUNT == 85, "GGML_OP_COUNT != 85");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up, even though GitHub gives no indication of it (since it's identical), this now conflicts with master.

Copy link
Collaborator Author

@am17an am17an Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I ran into the same issue on a prior rebase of this PR. I will do a rebase once this is ready to merge

@@ -1081,7 +1083,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
"adamw(x)",
};

static_assert(GGML_OP_COUNT == 84, "GGML_OP_COUNT != 84");
static_assert(GGML_OP_COUNT == 85, "GGML_OP_COUNT != 85");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants