Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single matmul runtime segfault due to K dimension out of bound access #378

Open
yifeizh2 opened this issue Oct 14, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@yifeizh2
Copy link
Contributor

During the tuning phase, we observed invalid config as follows

module attributes {dlti.target_system_spec = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"L1_cache_size_in_bytes", 49152 : ui32>, #dlti.dl_entry<"L2_cache_size_in_bytes", 2097152 : ui64>, #dlti.dl_entry<"L3_cache_size_in_bytes", 110100480 : ui64>, #dlti.dl_entry<"num_threads", 56 : i32>, #dlti.dl_entry<"max_vector_width", 512 : i64>>>} {
  func.func @entry(%arg0: tensor<128x11008xbf16>, %arg1: tensor<11008x4096xbf16>) -> tensor<128x4096xbf16> attributes {llvm.emit_c_interface} {
    %cst = arith.constant 0.000000e+00 : bf16
    %0 = tensor.empty() : tensor<128x4096xbf16>
    %1 = linalg.fill ins(%cst : bf16) outs(%0 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
    %2 = linalg.matmul {KBlock = 4096 : i32, KThreads = 2 : i32, MBlock = 32 : i32, MThreads = 1 : i32, NBlock = 32 : i32, NThreads = 28 : i32, cast = #linalg.type_fn<cast_signed>, innermostKBlock = 32 : i32, innermostMBlock = 32 : i32, innermostNBlock = 32 : i32} ins(%arg0, %arg1 : tensor<128x11008xbf16>, tensor<11008x4096xbf16>) outs(%1 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
    return %2 : tensor<128x4096xbf16>
  }
}

In this case, the existing tiling logic does not correctly handle the boundary of K dimension, generating code like

          %19 = scf.for %arg10 = %c0 to %c172 step %c128 iter_args(%arg11 = %extracted_slice_8) -> (tensor<32x32xf32>) {
            %21 = affine.apply affine_map<(d0) -> (d0 * 32)>(%arg10)
            %extracted_slice_10 = tensor.extract_slice %extracted_slice_4[0, %21] [32, 4096] [1, 1] : tensor<32x5504xbf16> to tensor<32x4096xbf16>

and causing runtime out of bound access.

@yifeizh2 yifeizh2 self-assigned this Oct 14, 2024
@yifeizh2
Copy link
Contributor Author

Synced with @zhczhong, even if we improved K dimension tail handling logic, it will lead to dynamic shape in brgemm, which could not be further lowered. So we need to further restrain the tuning space.

@kurapov-peter kurapov-peter added the bug Something isn't working label Oct 14, 2024
@ciyongch
Copy link
Contributor

Pull in all the restrictions to eliminate the possible "dynamic shapes" which feeding into brgemm could avoid the above lowering issue, but the config will result in sub-optimal performance.
As the "dynamic shape" here is mainly introduced by the undividable dims during tiling stage, which is actually a limited set of compile-time known shapes, we might need to support such scenario to improve the overall performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants