Enable the 2d block IO for tensor of pointer when the tensor on memory are contiguous. #3482

chengjunlu · 2025-02-21T07:13:24Z

General

This is an alternative way to lower the tt.load with tensor of pointers to the 2D block IO.
It depends on the analysis result from the ModuleAxisInfoAnalysis about the pointers and masks.

Background

Both the block pointer type (e:g: !tt.ptr<tensor<64x32xf16>) and the tensor of pointer type (e.g: tensor<64x32x!tt.ptr<f16>>) are used to describe a tensor resident on global memory.
The difference is that the tensor of pointers contains more "entropy" than block pointer. The tensor of pointers can describe a tensor on global memory which is randomly distributed, of cause it can be used to describe the structed distribution as block pointer.

There already is a optimization pass to raise the tensor of pointer to the block pointer for some cases with limitation.
This way supports more cases with less limitation to lower the tt.load with tensor of pointer to 2D block IO.

Idea

The ModuleAxisInfoAnalysis analysis the axis information such as, contiguity, divisibility and constancy of the value of the tensor for pointers and masks.
In general, we can use the 2D block IO lowering as long as:

the contiguity of the pointers is multiple of the threadsPerWarp size in the same Dim D.
The constancy of the masks is multiple of the threadsPerWarp size in the same Dim D.

We start with the case of the tt.load with DotOp and DPAS layout. To generalize the lowering code for more cases with LinearLayout in future.

chengjunlu · 2025-02-21T07:14:47Z

I will add more comprehensive LIT test cases.

…y are contiguous.

test/TritonIntelGPU/tensor-pointer-load-block-2d.mlir

alexbaden · 2025-02-25T02:39:47Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+    const bool memoryRowMajor = (memoryLayoutInfo == "row_major");
+
+    auto getOpIdx = [&]() -> DpasEncodingAttr::OpIdx {
+      if (hasDpasLayout) {


since you have dpasLayout above you can probably remove this branch. Then again, will we eventually be able to sync this function w/ the code in rewrite tensor pointer load?

I will remove the DPAS branch at first for simplicity.

The idea is that we can finally unify the tt.load lowering to block IO in one pattern class. The plan is to do it in several steps:

Unify the tt.load lowering to block IO for both tensor of pointers and block pointer.

Support DPAS and DotOp with DPAS layout for.

To support general cases with the LinearLayout: different shape of tensor, different layout of the value returned.

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

LiyangLingIntel · 2025-02-27T07:32:34Z

Triggerd https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/13561125937 to check the performance improvement.

etiotto · 2025-03-03T14:48:03Z

Have you tested what happen to a workload that use masked loads (that are loop variant) and the mask switches from true to false in the middle of the loop iteration (or viceversa) ?

Note: I have enabled (by default) a pass to version the loop for commonly masked loads already. Given that is in Triton do we need this changes ?

chengjunlu · 2025-03-03T14:53:17Z

Have you tested what happen to a workload that use masked loads (that are loop variant) ?

I uses the matmul kernel in tutorial 09 to test the masked loads which is not supported by raise block ptr at that time. We can only use the tensor of pointers. The first version of the performance is only about 30%~40% compare to the block pointer matmul in benchmark.

The masked loads with others are lowered to the if-else branches which cannot be optimized by IGC.

Functional worked but need more effort to improve the performance like integrate versioning pass and some others.

etiotto · 2025-03-03T16:23:56Z

Have you tested what happen to a workload that use masked loads (that are loop variant) ?

I uses the matmul kernel in tutorial 09 to test the masked loads which is not supported by raise block ptr at that time. We can only use the tensor of pointers. The first version of the performance is only about 30%~40% compare to the block pointer matmul in benchmark.

The masked loads with others are lowered to the if-else branches which cannot be optimized by IGC.

Functional worked but need more effort to improve the performance like integrate versioning pass and some others.

Ok this is promising. The loop versioning pass that landed last week (#3516) should version the loop and it will then contain unmasked tt.load operation. IGC will not need to deal with branches in the loop. BTW, I targeted the loop versioning pass using tutorial 03 so, at the moment, I am not sure whether it will version the loop in tutorial 09 (but the pass can be extended).

etiotto · 2025-03-03T16:32:33Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+    return success();
+  }
+
+private:


etiotto · 2025-03-03T16:34:05Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+
+  LogicalResult
+  matchAndRewrite(triton::LoadOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {


override -> final

etiotto · 2025-03-03T17:16:29Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+    auto encoding = tensorType.getEncoding();
+    const bool hasDpasLayout = isa<DpasEncodingAttr>(encoding);
+    if (!hasDpasLayout && !hasDotDpasEncoding(tensorType))
+      return failure();


Suggested change

auto encoding = tensorType.getEncoding();

const bool hasDpasLayout = isa<DpasEncodingAttr>(encoding);

if (!hasDpasLayout && !hasDotDpasEncoding(tensorType))

return failure();

const bool hasDpasLayout = hasDpasEncoding(tensorType);

if (!hasDpasLayout && !hasDotDpasEncoding(tensorType))

return failure();

auto encoding = cast<DPASEncodingAttr>(tensorType.getEncoding());

etiotto · 2025-03-03T17:20:56Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+    auto getOpIdx = [&]() -> DpasEncodingAttr::OpIdx {
+      if (hasDpasLayout) {
+        return DpasEncodingAttr::OpIdx::OperandC;
+      } else {
+        auto dotLayout = getDotEncoding(tensorType).value();
+        return static_cast<DpasEncodingAttr::OpIdx>(dotLayout.getOpIdx());
+      }
+    };


Suggested change

auto getOpIdx = [&]() -> DpasEncodingAttr::OpIdx {

if (hasDpasLayout) {

return DpasEncodingAttr::OpIdx::OperandC;

} else {

auto dotLayout = getDotEncoding(tensorType).value();

return static_cast<DpasEncodingAttr::OpIdx>(dotLayout.getOpIdx());

}

};

auto getOpIdx = [&]() -> DpasEncodingAttr::OpIdx {

if (hasDpasLayout)

return DpasEncodingAttr::OpIdx::OperandC;

assert(hasDotDpasEncoding(tensorType) && "Expecting dot layout);

auto dotLayout = getDotEncoding(tensorType).value();

return static_cast<DpasEncodingAttr::OpIdx>(dotLayout.getOpIdx());

};

etiotto · 2025-03-03T17:21:20Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+        return static_cast<DpasEncodingAttr::OpIdx>(dotLayout.getOpIdx());
+      }
+    };
+    auto opIdx = getOpIdx();


auto -> DpasEncodingAttr::OpIdx

etiotto · 2025-03-03T17:33:03Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+      SmallVector<Value> ptrElems, maskElems, otherElems;
+      // Get the LLVM values for pointers
+      ptrElems = unpackLLElements(loc, llPtr, rewriter);
+      assert(ptrElems.size() == numElems);


Add assert message

etiotto · 2025-03-03T17:34:49Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+          auto offsetOuter =
+              outer * repOuterStride +
+              rep * dpasInstShape[dimOuter] * numOperandsOuterDimPerLoad;
+          auto offsetInner = inner * dpasInstShape[dimInner];
+          auto offsetM = (isOperandA ? offsetOuter : offsetInner);
+          auto offsetN = (isOperandA ? offsetInner : offsetOuter);


Suggested change

auto offsetOuter =

outer * repOuterStride +

rep * dpasInstShape[dimOuter] * numOperandsOuterDimPerLoad;

auto offsetInner = inner * dpasInstShape[dimInner];

auto offsetM = (isOperandA ? offsetOuter : offsetInner);

auto offsetN = (isOperandA ? offsetInner : offsetOuter);

unsigned offsetOuter =

outer * repOuterStride +

rep * dpasInstShape[dimOuter] * numOperandsOuterDimPerLoad;

unsigned offsetInner = inner * dpasInstShape[dimInner];

unsigned offsetM = (isOperandA ? offsetOuter : offsetInner);

unsigned offsetN = (isOperandA ? offsetInner : offsetOuter);

etiotto · 2025-03-03T17:35:31Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+          pred = targetInfo.shuffleIdx(rewriter, loc, pred, 0);
+          Value other_ = b.undef(load2DGenXType);
+          if (others.size()) {
+            auto vecTy = vec_ty(eltTy, numValuesPerLoad * packedElemsNum);


Suggested change

auto vecTy = vec_ty(eltTy, numValuesPerLoad * packedElemsNum);

VectorType vecTy = vec_ty(eltTy, numValuesPerLoad * packedElemsNum);

etiotto · 2025-03-03T17:36:30Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+                    auto N = packedCol +
+                             col * threadsPerWarp * numColPerPackedValue +
+                             vblk * tileWidth + offsetN;
+                    auto M = i + offsetM;


Suggested change

auto N = packedCol +

col * threadsPerWarp * numColPerPackedValue +

vblk * tileWidth + offsetN;

auto M = i + offsetM;

unsigned N = packedCol +

col * threadsPerWarp * numColPerPackedValue +

vblk * tileWidth + offsetN;

unsigned M = i + offsetM;

etiotto · 2025-03-03T17:38:06Z

test/TritonIntelGPU/tensor-pointer-load-block-2d.mlir

+// RUN: triton-opt %s -split-input-file --intel-allocate-shared-memory --convert-triton-intel-gpu-to-llvm | FileCheck %s --implicit-check-not=llvm.inline_asm
+
+// CHECK:   llvm.func spir_funccc @_Z41intel_sub_group_2d_block_read_16b_8r16x2cPU3AS1viiiDv2_iPt
+#mma = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [2, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>


Rather long tests, can they be reduced and simplified please?

etiotto

I left several inline comments that should be addressed prior to merging this PR. The approach LGTM.

chengjunlu requested review from etiotto, whitneywhtsang and LiyangLingIntel February 21, 2025 07:13

chengjunlu marked this pull request as draft February 21, 2025 07:14

chengjunlu linked an issue Feb 21, 2025 that may be closed by this pull request

[Performance] Enable 2D Block IO lowering for tt.load with tensor of pointer #3483

Open

chengjunlu force-pushed the chengjun/tensorptr_blockio branch from 6e6f4c7 to 787dd41 Compare February 24, 2025 12:48

chengjunlu marked this pull request as ready for review February 24, 2025 12:49

Enable the 2d block IO for tensor of pointer when the tensor on memor…

787dd41

…y are contiguous.

alexbaden reviewed Feb 25, 2025

View reviewed changes

etiotto reviewed Mar 3, 2025

View reviewed changes

etiotto approved these changes Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the 2d block IO for tensor of pointer when the tensor on memory are contiguous. #3482

Enable the 2d block IO for tensor of pointer when the tensor on memory are contiguous. #3482

chengjunlu commented Feb 21, 2025 •

edited

Loading

chengjunlu commented Feb 21, 2025

alexbaden Feb 25, 2025

chengjunlu Feb 25, 2025

LiyangLingIntel commented Feb 27, 2025

etiotto commented Mar 3, 2025 •

edited

Loading

chengjunlu commented Mar 3, 2025 •

edited

Loading

etiotto commented Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

etiotto left a comment

	auto vecTy = vec_ty(eltTy, numValuesPerLoad * packedElemsNum);
	VectorType vecTy = vec_ty(eltTy, numValuesPerLoad * packedElemsNum);

Enable the 2d block IO for tensor of pointer when the tensor on memory are contiguous. #3482

Are you sure you want to change the base?

Enable the 2d block IO for tensor of pointer when the tensor on memory are contiguous. #3482

Conversation

chengjunlu commented Feb 21, 2025 • edited Loading

General

Background

Idea

chengjunlu commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiyangLingIntel commented Feb 27, 2025

etiotto commented Mar 3, 2025 • edited Loading

chengjunlu commented Mar 3, 2025 • edited Loading

etiotto commented Mar 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etiotto left a comment

Choose a reason for hiding this comment

chengjunlu commented Feb 21, 2025 •

edited

Loading

etiotto commented Mar 3, 2025 •

edited

Loading

chengjunlu commented Mar 3, 2025 •

edited

Loading