[ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option #3350

skykongkong8 · 2025-07-22T07:23:00Z

This PR includes:

Optimized q4_0 GEMM kernel for ARM -> previously it was using unoptimized for-loop fallback kernel
repacking function to support q4_0_4_8 gemm kernel and refactorize accordingly
bs threadpool version of q6_K GEMM (fine-grained)

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 [email protected]

- ggml_gemm_q4_0_4x8_q8_0 - ggml_gemv_q4_0_4x8_q8_0 - ggml_repack_q4_0_to_q4_0_4_bl - By changing adaptable function params and non-static functions **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Implement both openMP & bstp version of q4_0_4x8_q8_0 GEMM and GEMV - In the future _FP16 activation flow will be supported. Thus, add the function in a function template manner. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- SIMD-optimized q40 GEMM kernel is q4048, since NEON register is 128 bit, not 256 bit like AVX2 or SVE **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- ARM : use repack_q4_0 function use __ggml_repack_q4_0_to_q4_0_4 kernel - x86 : use repack_q4_0 function use __ggml_repack_q4_0_to_q4_0_8 kernel - Fix unittest accordingly **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- This impl is fine-grained multithreading. - Should compare with coarse-grained multithreading version later on. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Change GGML_FP16_TO_FP32 -> GGML_COMPUTE_FP16_TO_FP32 - This patch resolves zero-value issues after lm head. - Remember : this patch should be eventually removed, since this bug occurs bevause ggml_init function is not properly called at model runtime. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- In specific circumstances, it is better to build without cblas option. - trivial) Add option for ggml_interface.h to choose ggml related files or not. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- q8_0 quant/dequant function to compare f16 f32 quant loss comparison **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- quantize_q8_0 - dequantize_row_q8_0 - gemm_q4_0<float> and gemm_q4_0<_FP16> **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 added 5 commits July 22, 2025 10:02

skykongkong8 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, gichan-jang, anyj0527, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, djeong20, EunjuYang, dkjung and haehun as code owners July 22, 2025 07:23

skykongkong8 changed the title ~~Pr/ggml/arm/q4048 gemm~~ [ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation Jul 22, 2025

skykongkong8 mentioned this pull request Jul 22, 2025

cblas: num-thread config code clean #3342

Open

skykongkong8 changed the title ~~[ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation~~ [ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option Jul 22, 2025

github-actions bot added the Need Review label Jul 22, 2025

skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch from b4fbe30 to d0c54da Compare July 23, 2025 02:08

[ spec/bugfix ] Close unclosed if line in spec

25036f3

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch 2 times, most recently from 458c387 to f06709c Compare July 23, 2025 04:32

skykongkong8 added 2 commits July 23, 2025 13:37

[ ggml ] Implement half-precision activation q4_0 GEMM

49ed45e

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ ggml ] Add additional quantization functions

d90732a

- q8_0 quant/dequant function to compare f16 f32 quant loss comparison **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch from f06709c to 17f5eb1 Compare July 23, 2025 08:07

skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch from 17f5eb1 to 358c9f7 Compare July 23, 2025 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option #3350

[ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option #3350

skykongkong8 commented Jul 22, 2025

Uh oh!

Uh oh!

[ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option #3350

Are you sure you want to change the base?

[ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option #3350

Conversation

skykongkong8 commented Jul 22, 2025

This PR includes:

Uh oh!

Uh oh!