[DRAFT][s4xbf16] JoinOp vectorization patch (Rebased) #25

ggengnv · 2025-04-09T20:39:11Z

This is 1 of the 2 patches needed to improve int4xbf16 GEMM perf.

This is needed because joinOp by default interleaves every element of the two input matrices. In the case of bf16, this means Triton will extract the 2x bf16 values out of the 32-bit register and re-insert them into a new register. This results in many mov instructions before MMA. On certain shapes, this could mean a ~10% perf penalty.

This PR addresses the above by situationally "vectorizing" the interleaving; namely, join every two elements instead of one. This avoids the need to extract values out of registers. Of course, this would also require one to modify the inline_asm logic before the join to produce the correct layout.

cc @gflegar @loislo

Updating LLVM in order to pull in the following change: - llvm/llvm-project#128566 For context, crash reproduction generation in MLIR will run the `PassManager`'s passes in a child thread. The above PR fixes crashes for when passes such as `add_di_scope` add `DistinctAttr` to the IR and their storage is then accessed later once the child thread joins. Pulling this in improves QoL for out-of-tree projects and makes the pass manager more robust to the use of `DistinctAttr`. This pin update has also introduced the deprecation of a `llvm::TargetMachine::createTargetMachine` overload. I've updated the callsites to use the non-deprecated overloads. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `this PR only updates the LLVM pin, so CI is sufficient`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

abulavin and others added 2 commits April 3, 2025 16:37

Add heuristic for join to interleave every 2 elts

db4111f

ggengnv mentioned this pull request Apr 9, 2025

[DRAFT][s4xbf16] JoinOp vectorization patch #22

Closed

vwbaker force-pushed the llvm-head-staging branch from c629b06 to 017162e Compare April 22, 2025 14:35

gflegar force-pushed the llvm-head-staging branch from a4f5b2f to fe66e41 Compare May 6, 2025 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT][s4xbf16] JoinOp vectorization patch (Rebased) #25

[DRAFT][s4xbf16] JoinOp vectorization patch (Rebased) #25

ggengnv commented Apr 9, 2025

[DRAFT][s4xbf16] JoinOp vectorization patch (Rebased) #25

Are you sure you want to change the base?

[DRAFT][s4xbf16] JoinOp vectorization patch (Rebased) #25

Conversation

ggengnv commented Apr 9, 2025