Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Radu2k
Copy link
Contributor

@Radu2k Radu2k commented Mar 6, 2025

Description

This pull request introduces and enables Arm® KleidiAI™ microkernels on AArch64 through BRGeMM oneDNN API, consisting of two commits to add the new functionality.

Specifically:
cpu: aarch64: enable BRGeMM trough oneDNN API for AArch64

  • AArch64 BRGeMM oneDNN API Enablement: Enables the BRGeMM oneDNN API route on AArch64, ensuring that these newly introduced kernels can be leveraged on AArch64 hardware for improved int4/fp32 matrix multiplication performance while preserving compatibility.

cpu: aarch64: integrate KleidiAI trough oneDNN API

  • Integration of KleidiAI matmuls: Provides access through the BRGeMM interface to KleidiAI int4 channelwise + int4 groupwise kernels and fp32 kernels. It allows both full matrix multiplication and tile-based execution using vectors of (m_idx, n_idx) where m_idx, n_idx represent indexes for M respectively N given that we have pre-packed SRC(LHS) MxK and WEI(RHS) KxN.
  • Integration of KleidiAI packing functions: Expand the oneDNN API Transform functionality, allowing fused int4 quantisation+packing for SRC(LHS) and int4/fp32 WEI(RHS) packing.
  • Documentation and Benchdnn Updates: Reflect the new KleidiAI integration and enable testing for fp32 kernels using KleidiAI kernels.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Failing tests:
The following tests FAILED:
168 - test_graph_unit_dnnl_large_partition_cpu (Failed)
180 - test_graph_unit_dnnl_sdp_decomp_cpu (Failed)
191 - test_benchdnn_modeC_binary_ci_cpu (Subprocess aborted)
192 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Subprocess aborted)
200 - test_benchdnn_modeC_graph_ci_cpu (Subprocess aborted)

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

New features

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

@github-actions github-actions bot added documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:api Codeowner: @oneapi-src/onednn-arch component:tests Codeowner: @oneapi-src/onednn-arch component:build component:examples labels Mar 6, 2025
Comment on lines +315 to +319
// const bool has_zero_points = !utils::everyone_is(
// brgemm_broadcast_t::none, zp_type_a, zp_type_b, zp_type_c);
// return dt_c != dt_d || with_eltwise || with_binary || with_scales
// || with_bias || with_sum || req_s8s8_compensation
// || has_zero_points || with_dst_scales;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, but I have left them there for future guidance when will remove the todo on 313.

@Radu2k Radu2k force-pushed the feature/aarch64-kleidi-integration branch from 81b2d96 to 23d06e8 Compare March 6, 2025 21:05
Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general comments:

  • Ideally, the ukernel programming model should be backend independent, and implementation details like kai should not leak (neither in API nor in dedicated examples). We may do exceptions for API but this would require understanding the usability tradeoffs (maybe an RFC discussion would help?).
  • for post-operation, let's use the post-op mechanism and not introduce new attributes (bias). This has a clearer semantic wrt order of execution.
  • I am not sure I understand how bias attribute works: it seems to be applied as part of brgemm but is passed to transform. Could you clarify?


/// Packing for KleidiAI kernels
kai_pack_int4 = dnnl_pack_type_kai_pack_int4,
kai_pack_f32 = dnnl_pack_type_kai_pack_f32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the contract for oneDNN ukernel API is to only manipulate data in a layout transparent to the user (e.g pack32 for bfdot, pack64 for *mmla, ...). This allows"

  • users to compute on this data directly with their own custom routines.
  • or user to write their own custom transform routines that play nicely with ukernels.

If we break layout transparency, we severely impact the flexibility of this API (API user will have to do spurious transforms/copies to write their custom fused operations).

Can we express those kai packing in a transparent and generic way?

"could not query a packing size from a KleidiAI ukernel "
"object");
return size;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why transform output_size cannot be inferred from shapes and pack_type?
Ideally, user should be able to write their own custom transform routines when it is not supported by oneDNN ukernel API (e.g. for new quantization methods), so that they can write custom transform function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KleidiAI kernels are dependent on the encoded information within packed tensors in order to leverage memory locality on kernel execution. The design paradigm behind this packing is that kernel execution will output the desired result without extra steps.

Copy link
Contributor Author

@Radu2k Radu2k Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on this for clarification, size is kernel dependent. Int4 channelwise kernel requires a different layout for storing the bias, zero points and scales than int4 groupwise kernel. In the example, no quantisation information is needed as we do f32:f32:f32 but the kernel is able to fuse the bias addition binary postop encoding the bias within packed B.

Copy link
Contributor

@dzarukin dzarukin Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference would be to expose scratchpad instead and fit all extra memory requirements into it. This way users are not confused about the size of the output data and will be able to provide enough bytes for extra info.

Edit: after a talk with Mourad, it seems scratchpad doesn't solve anything, actual packed bias output is needed...

dnnl_status_t status = dnnl_ukernel_attr_params_set_bias(get(), bias);
if (status != dnnl_success)
error::wrap_c_api(status, "could not set B bias argument");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using binary post-op would be preferable:

  • it is easier for user to know when it is applied wrt to other post-ops
  • only a single API mechanism to fuse a broadcasted addition after the brgemm operation.

std::vector<std::pair<memory::dim, memory::dim>> A_B_offsets(1);
A_B_offsets[0] = std::make_pair(-1, -1);

brgemm_kai.execute(A_ptr, B_packed_ptr, A_B_offsets, C_ptr, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are adding bias to B during packing but not to brgemm.
I would expect the results to be
C = A * (B + bias).
However the reference computes C = (A * B) + bias.
Could you clarify the semantic of the bias attribute in pack_B ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bias is not applied to B when the packing is done but it is encoded in the packed B to be used in execute.
The packed result would be :
--------------------------------------------------------------------
|<(float)bias>|<additional_info(e.g.:zp)+weights to produce output>|
--------------------------------------------------------------------

"could not query a packing size from a KleidiAI ukernel "
"object");
return size;
}
Copy link
Contributor

@dzarukin dzarukin Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference would be to expose scratchpad instead and fit all extra memory requirements into it. This way users are not confused about the size of the output data and will be able to provide enough bytes for extra info.

Edit: after a talk with Mourad, it seems scratchpad doesn't solve anything, actual packed bias output is needed...

@@ -148,7 +148,7 @@ void brgemm_kernel_execute_postops(const brgemm_kernel_t *brg_kernel, int bs,
(*brg_kernel)(&brgemm_p);
}

status_t brgemm_desc_init(brgemm_t *brg, cpu_isa_t isa,
status_t brgemm_desc_init(brgemm_desc_t *brg, cpu_isa_t isa,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to promote this style change to main before everything - this would decrease the delta quite significantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We can create a separate PR for the first commit so we get this trough.

    - Expose oneDNN KleidiAI kernels via BRGeMM API
    - Enable tiling trough A,B offsets parameter
      	- pass "(-1, -1)" as offset for full matrix - MxN output
	- pass "vector of m_idx, n_idx" for one ukernel execution
	  where one execution computes (m_step X n_step)
    - Update documentation to validate integration
    - Add functionality to benchdnn to execute F32 Kleidi
      kernels via BRGeMM API
@Radu2k Radu2k force-pushed the feature/aarch64-kleidi-integration branch from 23d06e8 to 9afc7df Compare March 7, 2025 17:20
@dzarukin
Copy link
Contributor

Hi team, please check this PR out: #2862
If you could rebase on top and try to put implementation underneath the identical way and provide the feedback, that would be great.

The change targets easier maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:api Codeowner: @oneapi-src/onednn-arch component:build component:examples component:tests Codeowner: @oneapi-src/onednn-arch documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants