Skip to content

[OpenCL] Optimized Single-Precision GEMM Kernel #3122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

djeong20
Copy link
Contributor

@djeong20 djeong20 commented Apr 17, 2025

This pull request adds a highly optimized single-precision General Matrix Multiplication (GEMM) kernel developed for OpenCL. The enhancements in this kernel aim to improve computational efficiency and performance for matrix operations, reducing execution time and enhancing throughput.

Self-evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Result

PC (x86)

engine avg
CPU 108.05ms
GPU (prev) 3848.6 ms
GPU 68.5 ms

Android (aarch64)

engine avg
CPU 293.75ms
GPU (prev) 19600.8 ms
GPU 534.1 ms

Note

This is a profiling result with the following matrix size.

  • M: 1024
  • K: 3072
  • N: 3072

Copy link
Member

@skykongkong8 skykongkong8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this kernel aims to store 16x16 ukernel computed with (16,16)x(16x16) sub-A and sub-B block?
Looks good to me overall, but I have some ideas to test further🤔

@djeong20
Copy link
Contributor Author

So this kernel aims to store 16x16 ukernel computed with (16,16)x(16x16) sub-A and sub-B block? Looks good to me overall, but I have some ideas to test further🤔

Yes, this utilizes local memory for matrices A and B. Please feel free to share ideas!

This pull request adds a highly optimized single-precision General Matrix Multiplication (GEMM) kernel developed for OpenCL. The enhancements in this kernel aim to improve computational efficiency and performance for matrix operation, which reduces execution time and enhances throughput.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test:   [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <[email protected]>
@djeong20 djeong20 force-pushed the opencl/optimize/sgemm/v1 branch from 1453754 to 5515495 Compare April 23, 2025 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants