cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832

Radu2k · 2025-03-06T20:20:10Z

Description

This pull request introduces and enables Arm® KleidiAI™ microkernels on AArch64 through BRGeMM oneDNN API, consisting of two commits to add the new functionality.

Specifically:
cpu: aarch64: enable BRGeMM trough oneDNN API for AArch64

AArch64 BRGeMM oneDNN API Enablement: Enables the BRGeMM oneDNN API route on AArch64, ensuring that these newly introduced kernels can be leveraged on AArch64 hardware for improved int4/fp32 matrix multiplication performance while preserving compatibility.

cpu: aarch64: integrate KleidiAI trough oneDNN API

Integration of KleidiAI matmuls: Provides access through the BRGeMM interface to KleidiAI int4 channelwise + int4 groupwise kernels and fp32 kernels. It allows both full matrix multiplication and tile-based execution using vectors of (m_idx, n_idx) where m_idx, n_idx represent indexes for M respectively N given that we have pre-packed SRC(LHS) MxK and WEI(RHS) KxN.
Integration of KleidiAI packing functions: Expand the oneDNN API Transform functionality, allowing fused int4 quantisation+packing for SRC(LHS) and int4/fp32 WEI(RHS) packing.
Documentation and Benchdnn Updates: Reflect the new KleidiAI integration and enable testing for fp32 kernels using KleidiAI kernels.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Failing tests:
The following tests FAILED:
168 - test_graph_unit_dnnl_large_partition_cpu (Failed)
180 - test_graph_unit_dnnl_sdp_decomp_cpu (Failed)
191 - test_benchdnn_modeC_binary_ci_cpu (Subprocess aborted)
192 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Subprocess aborted)
200 - test_benchdnn_modeC_graph_ci_cpu (Subprocess aborted)

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

New features

Have you published an RFC for the new feature?
Was the RFC approved?
Have you added relevant tests?

README.md

doc/build/build_options.md

examples/ukernels/cpu_kleidiai.cpp

Ryo-not-rio · 2025-03-06T20:57:02Z

src/cpu/aarch64/brgemm/brgemm_types.hpp

+        // const bool has_zero_points = !utils::everyone_is(
+        //         brgemm_broadcast_t::none, zp_type_a, zp_type_b, zp_type_c);
+        // return dt_c != dt_d || with_eltwise || with_binary || with_scales
+        //         || with_bias || with_sum || req_s8s8_compensation
+        //         || has_zero_points || with_dst_scales;


Could do, but I have left them there for future guidance when will remove the todo on 313.

src/cpu/aarch64/brgemm/brgemm_utils.cpp

mgouicem

A few general comments:

Ideally, the ukernel programming model should be backend independent, and implementation details like kai should not leak (neither in API nor in dedicated examples). We may do exceptions for API but this would require understanding the usability tradeoffs (maybe an RFC discussion would help?).
for post-operation, let's use the post-op mechanism and not introduce new attributes (bias). This has a clearer semantic wrt order of execution.
I am not sure I understand how bias attribute works: it seems to be applied as part of brgemm but is passed to transform. Could you clarify?

mgouicem · 2025-03-07T13:26:58Z

include/oneapi/dnnl/dnnl_ukernel.hpp

+
+    /// Packing for KleidiAI kernels
+    kai_pack_int4 = dnnl_pack_type_kai_pack_int4,
+    kai_pack_f32 = dnnl_pack_type_kai_pack_f32,


So far the contract for oneDNN ukernel API is to only manipulate data in a layout transparent to the user (e.g pack32 for bfdot, pack64 for *mmla, ...). This allows"

users to compute on this data directly with their own custom routines.

or user to write their own custom transform routines that play nicely with ukernels.

If we break layout transparency, we severely impact the flexibility of this API (API user will have to do spurious transforms/copies to write their custom fused operations).

Can we express those kai packing in a transparent and generic way?

mgouicem · 2025-03-07T13:32:00Z

include/oneapi/dnnl/dnnl_ukernel.hpp

+                    "could not query a packing size from a KleidiAI ukernel "
+                    "object");
+        return size;
+    }


Could you clarify why transform output_size cannot be inferred from shapes and pack_type?
Ideally, user should be able to write their own custom transform routines when it is not supported by oneDNN ukernel API (e.g. for new quantization methods), so that they can write custom transform function.

KleidiAI kernels are dependent on the encoded information within packed tensors in order to leverage memory locality on kernel execution. The design paradigm behind this packing is that kernel execution will output the desired result without extra steps.

Adding on this for clarification, size is kernel dependent. Int4 channelwise kernel requires a different layout for storing the bias, zero points and scales than int4 groupwise kernel. In the example, no quantisation information is needed as we do f32:f32:f32 but the kernel is able to fuse the bias addition binary postop encoding the bias within packed B.

My personal preference would be to expose scratchpad instead and fit all extra memory requirements into it. This way users are not confused about the size of the output data and will be able to provide enough bytes for extra info.

Edit: after a talk with Mourad, it seems scratchpad doesn't solve anything, actual packed bias output is needed...

src/common/utils.hpp

include/oneapi/dnnl/dnnl_ukernel.h

mgouicem · 2025-03-07T13:49:47Z

include/oneapi/dnnl/dnnl_ukernel.hpp

+        dnnl_status_t status = dnnl_ukernel_attr_params_set_bias(get(), bias);
+        if (status != dnnl_success)
+            error::wrap_c_api(status, "could not set B bias argument");
+    }


Using binary post-op would be preferable:

it is easier for user to know when it is applied wrt to other post-ops

only a single API mechanism to fuse a broadcasted addition after the brgemm operation.

include/oneapi/dnnl/dnnl_ukernel.hpp

mgouicem · 2025-03-07T14:05:32Z

examples/ukernels/cpu_kleidiai.cpp

+    std::vector<std::pair<memory::dim, memory::dim>> A_B_offsets(1);
+    A_B_offsets[0] = std::make_pair(-1, -1);
+
+    brgemm_kai.execute(A_ptr, B_packed_ptr, A_B_offsets, C_ptr, nullptr);


Here you are adding bias to B during packing but not to brgemm.
I would expect the results to be
C = A * (B + bias).
However the reference computes C = (A * B) + bias.
Could you clarify the semantic of the bias attribute in pack_B ?

The bias is not applied to B when the packing is done but it is encoded in the packed B to be used in execute.
The packed result would be :
--------------------------------------------------------------------
|<(float)bias>|<additional_info(e.g.:zp)+weights to produce output>|
--------------------------------------------------------------------

dzarukin · 2025-03-07T17:08:07Z

include/oneapi/dnnl/dnnl_ukernel.hpp

+                    "could not query a packing size from a KleidiAI ukernel "
+                    "object");
+        return size;
+    }


My personal preference would be to expose scratchpad instead and fit all extra memory requirements into it. This way users are not confused about the size of the output data and will be able to provide enough bytes for extra info.

Edit: after a talk with Mourad, it seems scratchpad doesn't solve anything, actual packed bias output is needed...

dzarukin · 2025-03-07T17:13:15Z

src/cpu/aarch64/brgemm/brgemm.cpp

@@ -148,7 +148,7 @@ void brgemm_kernel_execute_postops(const brgemm_kernel_t *brg_kernel, int bs,
    (*brg_kernel)(&brgemm_p);
 }

-status_t brgemm_desc_init(brgemm_t *brg, cpu_isa_t isa,
+status_t brgemm_desc_init(brgemm_desc_t *brg, cpu_isa_t isa,


It would be nice to promote this style change to main before everything - this would decrease the delta quite significantly.

Agree. We can create a separate PR for the first commit so we get this trough.

- Expose oneDNN KleidiAI kernels via BRGeMM API - Enable tiling trough A,B offsets parameter - pass "(-1, -1)" as offset for full matrix - MxN output - pass "vector of m_idx, n_idx" for one ukernel execution where one execution computes (m_step X n_step) - Update documentation to validate integration - Add functionality to benchdnn to execute F32 Kleidi kernels via BRGeMM API

dzarukin · 2025-03-12T02:26:31Z

Hi team, please check this PR out: #2862
If you could rebase on top and try to put implementation underneath the identical way and provide the feedback, that would be great.

The change targets easier maintainability.

cpu: aarch64: enable BRGEMM trough oneDNN API for AArch64

239dfba

Ryo-not-rio reviewed Mar 6, 2025

View reviewed changes

Radu2k force-pushed the feature/aarch64-kleidi-integration branch from 81b2d96 to 23d06e8 Compare March 6, 2025 21:05

mgouicem reviewed Mar 7, 2025

View reviewed changes

dzarukin reviewed Mar 7, 2025

View reviewed changes

Radu2k force-pushed the feature/aarch64-kleidi-integration branch from 23d06e8 to 9afc7df Compare March 7, 2025 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832

cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832

Radu2k commented Mar 6, 2025 •

edited

Loading

Ryo-not-rio Mar 6, 2025

Radu2k Mar 6, 2025

mgouicem left a comment

mgouicem Mar 7, 2025

mgouicem Mar 7, 2025

Radu2k Mar 7, 2025

Radu2k Mar 7, 2025 •

edited

Loading

dzarukin Mar 7, 2025 •

edited

Loading

mgouicem Mar 7, 2025

mgouicem Mar 7, 2025

Radu2k Mar 7, 2025

dzarukin Mar 7, 2025 •

edited

Loading

dzarukin Mar 7, 2025

Radu2k Mar 7, 2025

dzarukin commented Mar 12, 2025

cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832

Are you sure you want to change the base?

cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832

Conversation

Radu2k commented Mar 6, 2025 • edited Loading

Description

Checklist

General

Performance improvements

New features

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgouicem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Radu2k Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

dzarukin Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzarukin Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzarukin commented Mar 12, 2025

Radu2k commented Mar 6, 2025 •

edited

Loading

Radu2k Mar 7, 2025 •

edited

Loading

dzarukin Mar 7, 2025 •

edited

Loading

dzarukin Mar 7, 2025 •

edited

Loading