Skip to content

[ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option #3350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

skykongkong8
Copy link
Member

This PR includes:

  1. Optimized q4_0 GEMM kernel for ARM -> previously it was using unoptimized for-loop fallback kernel
  2. repacking function to support q4_0_4_8 gemm kernel and refactorize accordingly
  3. bs threadpool version of q6_K GEMM (fine-grained)

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 [email protected]

- ggml_gemm_q4_0_4x8_q8_0
- ggml_gemv_q4_0_4x8_q8_0
- ggml_repack_q4_0_to_q4_0_4_bl
- By changing adaptable function params and non-static functions

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Implement both openMP & bstp version of q4_0_4x8_q8_0 GEMM and GEMV
- In the future _FP16 activation flow will be supported. Thus, add the function in a function template manner.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- SIMD-optimized q40 GEMM kernel is q4048, since NEON register is 128 bit, not 256 bit like AVX2 or SVE

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- ARM : use repack_q4_0 function use __ggml_repack_q4_0_to_q4_0_4 kernel
- x86 : use repack_q4_0 function use __ggml_repack_q4_0_to_q4_0_8 kernel
- Fix unittest accordingly

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- This impl is fine-grained multithreading.
- Should compare with coarse-grained multithreading version later on.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 changed the title Pr/ggml/arm/q4048 gemm [ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation Jul 22, 2025
- Change GGML_FP16_TO_FP32 -> GGML_COMPUTE_FP16_TO_FP32
- This patch resolves zero-value issues after lm head.
- Remember : this patch should be eventually removed, since this bug occurs bevause ggml_init function is not properly called at model runtime.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 changed the title [ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation [ cpu_backend ] Add q4_0_4_8 GEMM + multithreading acceleration + q6_K bstp implemenation + cblas removal option Jul 22, 2025
- In specific circumstances, it is better to build without cblas option.
- trivial) Add option for ggml_interface.h to choose ggml related files or not.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch from b4fbe30 to d0c54da Compare July 23, 2025 02:08
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch 2 times, most recently from 458c387 to f06709c Compare July 23, 2025 04:32
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- q8_0 quant/dequant function to compare f16 f32 quant loss comparison

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch from f06709c to 17f5eb1 Compare July 23, 2025 08:07
- quantize_q8_0
- dequantize_row_q8_0
- gemm_q4_0<float> and gemm_q4_0<_FP16>

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 force-pushed the pr/ggml/arm/q4048GEMM branch from 17f5eb1 to 358c9f7 Compare July 23, 2025 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant