Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IterativeTilingAndFusionPass] Wrap linalg.ops in a loop even if the shape is smaller than min tiling size #332

Open
dchigarev opened this issue Sep 10, 2024 · 1 comment

Comments

@dchigarev
Copy link
Contributor

In cases where the shape of a linalg operation is smaller or equal to the minimal tile size (which is 32) the operation is untouched and left as it is. That's the problem as our GPU pipeline expects a for-loop (that will later describe a launch grid) after the IterativeTilingAndFusion pass. If there's no loop the pipeline breaks.

For the stability reasons, I would expect that such operations would be wrapped into a single-iteration for-loop just to make pipeline working even on those corner cases:

func.func @linalg_matmul(%arg0: tensor<32x32xf16>, %arg1: tensor<32x32xf16>,
                         %arg2: tensor<32x32xf16>) -> tensor<32x32xf16> {
  %0 = linalg.matmul ins(%arg0, %arg1 : tensor<32x32xf16>, tensor<32x32xf16>)
                     outs(%arg2 : tensor<32x32xf16>) -> tensor<32x32xf16>
  return %0 : tensor<32x32xf16>
}

// Expected output (a tiling for loop consisting of one iteration):
func.func @linalg_matmul(%arg0: tensor<32x32xf16>, %arg1: tensor<32x32xf16>, %arg2: tensor<32x32xf16>) -> tensor<32x32xf16> {
  %0 = scf.forall (%arg3, %arg4) = (0, 0) to (32, 32) step (32, 32) shared_outs(%arg5 = %arg2) -> (tensor<32x32xf16>) {
    %extracted_slice = tensor.extract_slice %arg0[%arg3, 0] [32, 32] [1, 1] : tensor<32x32xf16> to tensor<32x32xf16>
    %extracted_slice_0 = tensor.extract_slice %arg1[0, %arg4] [32, 32] [1, 1] : tensor<32x32xf16> to tensor<32x32xf16>
    %extracted_slice_1 = tensor.extract_slice %arg5[%arg3, %arg4] [32, 32] [1, 1] : tensor<32x32xf16> to tensor<32x32xf16>
    %1 = linalg.matmul ins(%extracted_slice, %extracted_slice_0 : tensor<32x32xf16>, tensor<32x32xf16>) outs(%extracted_slice_1 : tensor<32x32xf16>) -> tensor<32x32xf16>
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %1 into %arg5[%arg3, %arg4] [32, 32] [1, 1] : tensor<32x32xf16> into tensor<32x32xf16>
    }
  }
  return %0 : tensor<32x32xf16>
}

P.S. this is not critical, as in real-life scenarios we would likely not meet ops with such small shapes

@Yun-Fly
Copy link
Contributor

Yun-Fly commented Sep 24, 2024

Hi, @dchigarev , Sorry for the late.

I can explain more for your example:

func.func @linalg_matmul(%arg0: tensor<32x32xf16>, %arg1: tensor<32x32xf16>,
                         %arg2: tensor<32x32xf16>) -> tensor<32x32xf16> {
  %0 = linalg.matmul ins(%arg0, %arg1 : tensor<32x32xf16>, tensor<32x32xf16>)
                     outs(%arg2 : tensor<32x32xf16>) -> tensor<32x32xf16>
  return %0 : tensor<32x32xf16>
}

First of all you can comment out Line 711-712 in IterativeTilingAndFusion.cpp to get your expected output:

// if (*tripCount == ts)
//       break;

The root cause why I add this limitation is that it would generate non-zero offset. I.e. %arg3 in %extracted_slice = tensor.extract_slice %arg0[%arg3, 0], which equals to constant value 0 in fact. Some tilingInterface expects this offset must be constant, such as packOp or unPackOp. Otherwise, it may break fusion.

BTW, for single linalgOp lowering in you context, I think this limitation is unnecessary :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants