Skip to content

OpenCL: add conv2d kernel #14403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

rmatif
Copy link
Contributor

@rmatif rmatif commented Jun 26, 2025

Following up on #14316 and #14388, this PR adds a direct conv2d kernel for OpenCL. To maximize performance, this kernel uses a mixed-precision approach: data is stored in local memory as FP16 to save bandwidth and the core operations are vectorized using float4 for higher throughput.
Because of this, a comparison with an indirect conv2d implementation is not based on identical precision and it's not a fair comparison. I thought that since this is mainly designed for Adreno GPUs, we could sacrifice some accuracy for the benefit of maximum performance, which is a significant bottleneck on these devices. As a result, some tests fail by a small margin due to the precision differences, hope it's still okay!

I am opening this PR to gather feedback and to see if this performance/accuracy trade-off is acceptable or not

Performance:

GPU Direct (GFLOPS) Indirect (GFLOPS) Speedup
Adreno 830 520.74 38.02 13.70x
Adreno 750 385.77 27.28 14.14x
Adreno 740 211.38 25.12 8.42x
Adreno 730 158.83 19.34 8.21x

@lhez @max-krasnyansky

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 26, 2025
@etasnadi
Copy link
Contributor

It seems that your kernel is the opencl vectorized version of the vulkan kernel I proposed, but I do not see this kind of performance improvement on Vulkan over the indirect impl. You might want to disable vectorized access to see what causes the improvement.

@rmatif
Copy link
Contributor Author

rmatif commented Jun 27, 2025

It seems that your kernel is the opencl vectorized version of the vulkan kernel I proposed, but I do not see this kind of performance improvement on Vulkan over the indirect impl. You might want to disable vectorized access to see what causes the improvement.

I have taken inspiration from your CUDA implementation, thanks for it!, so it's pretty much similar approach

After disabling vectorization, the scalar kernel achieves 182.18 GFLOPS on the Adreno 830. I think the significant speedup over the indirect implementation is mainly due to the current OpenCL backend being unoptimized, rather than any specific feature of the new kernel. The im2col kernel has poor memory access patterns and performs worse than the CPU implementation, while the subsequent mul_mat operation falls back to a generic f16 kernel which is not well-optimized, as only the q4_0 kernels show good looking performance as of now

@etasnadi
Copy link
Contributor

It seems that your kernel is the opencl vectorized version of the vulkan kernel I proposed, but I do not see this kind of performance improvement on Vulkan over the indirect impl. You might want to disable vectorized access to see what causes the improvement.

I have taken inspiration from your CUDA implementation, thanks for it!, so it's pretty much similar approach

After disabling vectorization, the scalar kernel achieves 182.18 GFLOPS on the Adreno 830. I think the significant speedup over the indirect implementation is mainly due to the current OpenCL backend being unoptimized, rather than any specific feature of the new kernel. The im2col kernel has poor memory access patterns and performs worse than the CPU implementation, while the subsequent mul_mat operation falls back to a generic f16 kernel which is not well-optimized, as only the q4_0 kernels show good looking performance as of now

It's good to know -- that could be the reason. I also observed this in Vulkan: the direct kernel is faster because the mul_mat kernel is not optimized well enough (at least not to my device) while the direct kernel is more of less more optimized to my device.

I also ported the direct kernel to CUDA and found that the indirect im2col&cuBLAS based mul_mat is ~33% faster than my direct kernel on Turing (the cuBLAS matmul is very highly optimized). I found this promising because there are lots of opportunities for optimization in the direct kernel (eliminating bank conflicts, warp-tiling, double buffering, faster computation of the offsets), so the direct kernel could become on par with the highly optimized indirect kernel in performance while not wasting lots of memory as im2col does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants