[MLAS] Add an NHWC implementation of convolution to avoid transposes#26834
Conversation
…transposes * Modification to the CPU EP to specify channels_last when data format is NWHC * Added a FusedNhwcConv kernel * Implementation of the kernel in mlas * Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed) * Added unittests Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
@microsoft-github-policy-service agree company="Arm" |
|
Feedback appreciated as this PR makes quite a lot of changes to the codebase well outside of the normal KleidiAI scope. |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
Hi @orlmon01, I imagine that avoiding transposes also improves performance. |
Hiya, I have some numbers somewhere from a Mobilenet model I was using for testing that I'll add in a bit, once I find / regenerate them. :) |
|
mobilenet model without the current patch:
Same model with changes:
|
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Update to the internal_testings_tests helper macros for file expansion so it works on other platforms
Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware
For for failing TestSaveAndLoadOrtModel test Make sure the model being saved / loaded is being done from a writeable location
Fix for undeclared identifier linker error
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR adds an NHWC (channels-last) implementation of convolution operations to avoid costly transpose operations in the CPU execution provider. The implementation includes KleidiAI-specific optimizations and a fallback path for NHWC convolutions.
Changes:
- Added
NhwcFusedConvkernel for float32 convolutions in NHWC layout (KleidiAI-guarded) - Implemented NHWC fast path and fallback path with explicit NHWC↔NCHW conversions in MLAS
- Extended test infrastructure to resolve paths dynamically and filter NHWC transformers in existing tests
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cpu/nn/conv.h | Added channels_last_ flag to Conv kernel |
| onnxruntime/core/providers/cpu/nn/conv.cc | Implemented NHWC convolution logic with fast path and fallback |
| onnxruntime/core/optimizer/nhwc_transformer.cc | Added KleidiAI filter and FusedConv sum input handling |
| onnxruntime/core/mlas/lib/convolve.cpp | Added ChannelsLast parameter to MlasConvPrepare |
| onnxruntime/core/mlas/inc/mlas.h | Added ChannelsLast field to MLAS_CONV_PARAMETERS |
| onnxruntime/contrib_ops/cpu/fused_conv.cc | Registered NhwcFusedConv kernel |
| onnxruntime/test/optimizer/nhwc_transformer_test.cc | Added depthwise convolution test case |
| onnxruntime/test/optimizer/fuse_initializers_transformer_test.cc | Filtered NhwcTransformer in tests |
| onnxruntime/test/optimizer/conv_add_act_test.cc | Updated to handle both FusedConv variants |
| onnxruntime/test/internal_testing_ep/internal_testing_tests.cc | Added path resolution utilities |
| onnxruntime/test/framework/ort_model_only_test.cc | Added path resolution with diagnostic output |
| onnxruntime/core/util/math_cpu.cc | Added Im2col instantiation for float NHWC |
| onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp | Updated to support channels-last input |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* NhwcFusedConv should now be available in minimal builds as the resolver bytes contain it Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
* RHS hashing now uses the full tensor to ensure uniqueness * LHS no longer uses hashing as it's unnecessary Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
* The change replaces the unsafe long-lived thread_local RHS cache with a kernel-owned packed-filter cache for NHWC float conv when the filter and optional bias are constant initializers. The packed RHS is built once per kernel instance and then reused safely for that kernel’s lifetime, which avoids pointer/hash aliasing issues without pulling in the larger ORT prepack machinery. Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
This reverts commit 39edc59.
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…mationRequiredOps test Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…solverCanResolveNhwcFusedConv and have it check load time instead of the saved model Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…r needed Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…d tests Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Remove extraneous whitespace
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…transposes * Modification to the CPU EP to specify channels_last when data format is NWHC * Added a FusedNhwcConv kernel * Implementation of the kernel in mlas * Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed) * Added unittests * Rebased * Docs regenerated Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
This reverts commit 7257c60.
Description
Currently OnnxRT supports NCHW as a default datalayout. For optimisations and kernels that operate better in NHWC layout, or where the datalayout is NHWC in the first place Transposes are added around the layers. This patch seeks to eliminate them in cases of convolutions where it would cause a performance decrease.
Motivation and Context
KleidiAi specific implementation of this feature. Only supports convolutions, DepthWise to follow. Currently a little strict with the filters as a result.