[OpenCL] Optimized Single-Precision GEMM Kernel #3122

djeong20 · 2025-04-17T02:19:03Z

This pull request adds a highly optimized single-precision General Matrix Multiplication (GEMM) kernel developed for OpenCL. The enhancements in this kernel aim to improve computational efficiency and performance for matrix operations, reducing execution time and enhancing throughput.

Self-evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Result

PC (x86)

engine	avg
CPU	108.05ms
GPU (prev)	3848.6 ms
GPU	68.5 ms

Android (aarch64)

engine	avg
CPU	293.75ms
GPU (prev)	19600.8 ms
GPU	534.1 ms

Note

This is a profiling result with the following matrix size.

M: 1024
K: 3072
N: 3072

skykongkong8

So this kernel aims to store 16x16 ukernel computed with (16,16)x(16x16) sub-A and sub-B block?
Looks good to me overall, but I have some ideas to test further🤔

djeong20 · 2025-04-21T06:20:35Z

So this kernel aims to store 16x16 ukernel computed with (16,16)x(16x16) sub-A and sub-B block? Looks good to me overall, but I have some ideas to test further🤔

Yes, this utilizes local memory for matrices A and B. Please feel free to share ideas!

This pull request adds a highly optimized single-precision General Matrix Multiplication (GEMM) kernel developed for OpenCL. The enhancements in this kernel aim to improve computational efficiency and performance for matrix operation, which reduces execution time and enhances throughput. **Self-evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>

djeong20 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, gichan-jang, anyj0527, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, skykongkong8, EunjuYang, dkjung and haehun as code owners April 17, 2025 02:19

github-actions bot added the Need Review label Apr 17, 2025

skykongkong8 approved these changes Apr 17, 2025

View reviewed changes

djeong20 force-pushed the opencl/optimize/sgemm/v1 branch from 257ce75 to 1453754 Compare April 17, 2025 07:24

skykongkong8 mentioned this pull request Apr 18, 2025

[Wait for #3122][OpenCL] Optimized Half-Precision GEMM Kernel #3123

Open

djeong20 force-pushed the opencl/optimize/sgemm/v1 branch from 1453754 to 5515495 Compare April 23, 2025 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenCL] Optimized Single-Precision GEMM Kernel #3122

[OpenCL] Optimized Single-Precision GEMM Kernel #3122

djeong20 commented Apr 17, 2025 •

edited

Loading

skykongkong8 left a comment

djeong20 commented Apr 21, 2025

[OpenCL] Optimized Single-Precision GEMM Kernel #3122

Are you sure you want to change the base?

[OpenCL] Optimized Single-Precision GEMM Kernel #3122

Conversation

djeong20 commented Apr 17, 2025 • edited Loading

Result

PC (x86)

Android (aarch64)

Note

skykongkong8 left a comment

Choose a reason for hiding this comment

djeong20 commented Apr 21, 2025

djeong20 commented Apr 17, 2025 •

edited

Loading