-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832
base: main
Are you sure you want to change the base?
cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API #2832
Conversation
// const bool has_zero_points = !utils::everyone_is( | ||
// brgemm_broadcast_t::none, zp_type_a, zp_type_b, zp_type_c); | ||
// return dt_c != dt_d || with_eltwise || with_binary || with_scales | ||
// || with_bias || with_sum || req_s8s8_compensation | ||
// || has_zero_points || with_dst_scales; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do, but I have left them there for future guidance when will remove the todo
on 313.
81b2d96
to
23d06e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few general comments:
- Ideally, the ukernel programming model should be backend independent, and implementation details like kai should not leak (neither in API nor in dedicated examples). We may do exceptions for API but this would require understanding the usability tradeoffs (maybe an RFC discussion would help?).
- for post-operation, let's use the post-op mechanism and not introduce new attributes (bias). This has a clearer semantic wrt order of execution.
- I am not sure I understand how bias attribute works: it seems to be applied as part of brgemm but is passed to transform. Could you clarify?
|
||
/// Packing for KleidiAI kernels | ||
kai_pack_int4 = dnnl_pack_type_kai_pack_int4, | ||
kai_pack_f32 = dnnl_pack_type_kai_pack_f32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the contract for oneDNN ukernel API is to only manipulate data in a layout transparent to the user (e.g pack32 for bfdot, pack64 for *mmla, ...). This allows"
- users to compute on this data directly with their own custom routines.
- or user to write their own custom transform routines that play nicely with ukernels.
If we break layout transparency, we severely impact the flexibility of this API (API user will have to do spurious transforms/copies to write their custom fused operations).
Can we express those kai packing in a transparent and generic way?
"could not query a packing size from a KleidiAI ukernel " | ||
"object"); | ||
return size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify why transform output_size cannot be inferred from shapes and pack_type?
Ideally, user should be able to write their own custom transform routines when it is not supported by oneDNN ukernel API (e.g. for new quantization methods), so that they can write custom transform function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KleidiAI kernels are dependent on the encoded information within packed tensors in order to leverage memory locality on kernel execution. The design paradigm behind this packing is that kernel execution will output the desired result without extra steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding on this for clarification, size is kernel dependent. Int4 channelwise kernel requires a different layout for storing the bias, zero points and scales than int4 groupwise kernel. In the example, no quantisation information is needed as we do f32:f32:f32
but the kernel is able to fuse the bias addition binary postop encoding the bias within packed B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal preference would be to expose scratchpad instead and fit all extra memory requirements into it. This way users are not confused about the size of the output data and will be able to provide enough bytes for extra info.
Edit: after a talk with Mourad, it seems scratchpad doesn't solve anything, actual packed bias output is needed...
dnnl_status_t status = dnnl_ukernel_attr_params_set_bias(get(), bias); | ||
if (status != dnnl_success) | ||
error::wrap_c_api(status, "could not set B bias argument"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using binary post-op would be preferable:
- it is easier for user to know when it is applied wrt to other post-ops
- only a single API mechanism to fuse a broadcasted addition after the brgemm operation.
std::vector<std::pair<memory::dim, memory::dim>> A_B_offsets(1); | ||
A_B_offsets[0] = std::make_pair(-1, -1); | ||
|
||
brgemm_kai.execute(A_ptr, B_packed_ptr, A_B_offsets, C_ptr, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you are adding bias to B during packing but not to brgemm.
I would expect the results to be
C = A * (B + bias).
However the reference computes C = (A * B) + bias.
Could you clarify the semantic of the bias attribute in pack_B ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bias is not applied to B when the packing is done but it is encoded in the packed B to be used in execute.
The packed result would be :
--------------------------------------------------------------------
|<(float)bias>|<additional_info(e.g.:zp)+weights to produce output>|
--------------------------------------------------------------------
"could not query a packing size from a KleidiAI ukernel " | ||
"object"); | ||
return size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal preference would be to expose scratchpad instead and fit all extra memory requirements into it. This way users are not confused about the size of the output data and will be able to provide enough bytes for extra info.
Edit: after a talk with Mourad, it seems scratchpad doesn't solve anything, actual packed bias output is needed...
@@ -148,7 +148,7 @@ void brgemm_kernel_execute_postops(const brgemm_kernel_t *brg_kernel, int bs, | |||
(*brg_kernel)(&brgemm_p); | |||
} | |||
|
|||
status_t brgemm_desc_init(brgemm_t *brg, cpu_isa_t isa, | |||
status_t brgemm_desc_init(brgemm_desc_t *brg, cpu_isa_t isa, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to promote this style change to main before everything - this would decrease the delta quite significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. We can create a separate PR for the first commit so we get this trough.
- Expose oneDNN KleidiAI kernels via BRGeMM API - Enable tiling trough A,B offsets parameter - pass "(-1, -1)" as offset for full matrix - MxN output - pass "vector of m_idx, n_idx" for one ukernel execution where one execution computes (m_step X n_step) - Update documentation to validate integration - Add functionality to benchdnn to execute F32 Kleidi kernels via BRGeMM API
23d06e8
to
9afc7df
Compare
Hi team, please check this PR out: #2862 The change targets easier maintainability. |
Description
This pull request introduces and enables Arm® KleidiAI™ microkernels on AArch64 through BRGeMM oneDNN API, consisting of two commits to add the new functionality.
Specifically:
cpu: aarch64: enable BRGeMM trough oneDNN API for AArch64
cpu: aarch64: integrate KleidiAI trough oneDNN API
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?Failing tests:
The following tests FAILED:
168 - test_graph_unit_dnnl_large_partition_cpu (Failed)
180 - test_graph_unit_dnnl_sdp_decomp_cpu (Failed)
191 - test_benchdnn_modeC_binary_ci_cpu (Subprocess aborted)
192 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Subprocess aborted)
200 - test_benchdnn_modeC_graph_ci_cpu (Subprocess aborted)
Performance improvements
New features