v0.1.3
What's Changed
- [Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
- Add cpu jit with backend ctypes by @xs-keju in #154
- [Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
- [Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
- [Enhancement] Improve CUDA path detection by @xwhzz in #157
- [Refactor] Replace
T.thread_bindingwithT.get_thread_bindingin examples and test cases by @LeiWang1999 in #163 - [Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
- [Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
- [Release] Bump version to v0.1.2.post1 by @LeiWang1999 in #166
- Use SS-GEMM for PV in mla by @YouJiacheng in #165
- [Example] Implement tilelang native sparse attention varlen example by @LeiWang1999 in #170
- [Bugfix] Implement boundary check for the buffer shape with dynamic symbolic by @LeiWang1999 in #173
- [AutoTune] Enable config-performance trace by @LeiWang1999 in #174
- [Feat] Append Pass Context and TMA lowering configuration option by @LeiWang1999 in #175
- [Feat] Introduce new caching mechanism for compiled kernels by @LeiWang1999 in #176
- [Refactor] Enhance GPU Kernel Launch with Environment Thread Creation by @LeiWang1999 in #178
- [Bugfix] Improve Thread Variable Handling in Layout Inference by @LeiWang1999 in #179
- [Examples] Implement NSA Backward kernels by @LeiWang1999 in #180
- [Enhancement] Optimize CMake build process with dynamic job count calculation by @LeiWang1999 in #183
- [Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation by @LeiWang1999 in #185
- [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug by @chengyupku in #188
- [Dev] Add the failed nvcc command to the exception message by @penguin-wwy in #189
- [Bugfix] Fix
T.copyfor scalar datatypes by @LeiWang1999 in #190 - [Enhancement] Simplify GEMM example with direct kernel compilation by @LeiWang1999 in #191
- [Bugfix] Make quickstart work properly on cu118 by @penguin-wwy in #193
- [Language] Support clamp in language by @hyx1999 in #192
- [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter by @chengyupku in #194
- [Feature] Add TMA Store Synchronization Support by @chengyupku in #195
- Update expired example code. by @66RING in #196
- [CMake] Add CUDA Major Version Detection for Conditional Compilation by @chengyupku in #197
- [Feature] Support Async Pipeline inference within if scope by @LeiWang1999 in #198
- [Dev] Add new example for FlashAttention with pipelined execution by @chengyupku in #200
- [Enhancement] Enhancing the handling of conditional statements in the pipeline by @LeiWang1999 in #201
- [Feature] Upgrade cutlass version and support fp8 T.gemm by @zqh-wz in #202
- [Docker] Update Dockerfiles to specify exact version of libstdcxx-ng by @LeiWang1999 in #203
- [Dev] Add GQA backward example by @chengyupku in #205
- [LICENSE] Typo fix in LICENSE by @LeiWang1999 in #208
- [Enhancement] Allow mma fallback when wgmma is not supported by @LeiWang1999 in #206
- [Examples] Expand tuning configurations for FlashAttention example by @chenghuaWang in #204
- [Enhancement] Avoid tvm ffi handling when out_idx is specified by @LeiWang1999 in #209
- [Fix] Fix K // block_K to T.ceildiv(K,block_K) and add tests by @hyx1999 in #210
- [Dev] Implement IfStmtBinding and MergeIfStmt transformations by @chengyupku in #211
- [Language] Introduce
T.reshapeandT.viewby @LeiWang1999 in #212 - [Enhancement] Improve device handling in Cython kernel adapter by @LeiWang1999 in #220
- [Enhancement] Update format script to support force compare with upstream by @LeiWang1999 in #221
- [Refactor] Introduce KernelParam integration across modules by @LeiWang1999 in #223
- [Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper by @zqh-wz in #224
- [Refactor] Update kernel compilation and profiling in examples by @chengyupku in #225
- [Examples] Add fp8 gemm 2xAcc and deepgemm example by @cherichy in #217
- [Doc] Add instructions for installing nightly version by @xwhzz in #226
- [Bugfix] Disable force inline for ldmatrix by @LeiWang1999 in #227
- [Bugfix] Support duplicate tma desc declaration by @LeiWang1999 in #228
- [Refactor] Rename clamp functions and enhance dtype handling in tests by @LeiWang1999 in #232
- [Enhancement] Simplify kernel source extraction in JIT adapters by @LeiWang1999 in #230
- [Feature] Add reduce_max corresponding tests by @LeiWang1999 in #236
- [BugFix] Fix bug of missing MBarrierExpectTX by @chengyupku in #241
- [Refactor] Refactor for Better Layout Conflict Handling by @LeiWang1999 in #240
- [Refactor] Align torch_assert_close tensor comparison with torch.testing.assert_close by @xwhzz in #239
- [Dev] Implement FlashAttention3 Backward by @chengyupku in #244
- [BugFix] Fix bug of mismatching dtype in testing by @xwhzz in #245
- [Enhancement] Add zero initialization option to GEMM operations by @chengyupku in #246
- [Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to
minBlocksPerMultiprocesorby @cherichy in #248 - [Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters by @Alex4210987 in #213
- [Examples] Implement elementwise add kernel by @chenghuaWang in #219
- [Refactor] Phaseout LLVM Dependency by Making it Optional by @LeiWang1999 in #247
- [Readme] Update Bib Citation Section by @LeiWang1999 in #249
- [Enhancement] Support float variable as arguments by @LeiWang1999 in #250
- add autotune to example_gemm.py by @yyttt6 in #252
- [Language] Introduce
T.alloc_varto define a variable likeint var;by @LeiWang1999 in #255 - [Example] Implement Kernel Example cumsum by @LeiWang1999 in #258
- [Refactor] Refactor CUDA post-processing callback registration in TileLang by @LeiWang1999 in #259
- [Refactor] Move compilation outside critical section by @YouJiacheng in #260
- [CI] Use auditwheel to generate manylinux wheels by @oraluben in #251
- [Bugfix] Fix Benchmark/Example Code for Autotuning by @SiriusNEO in #254
- [Language] Enhance alias to support blockwise memory load by @LeiWang1999 in #261
- [Bugfix] Fix auto tuning tma handling by @LeiWang1999 in #263
- [Release] Bump version to 0.1.3 by @LeiWang1999 in #264
New Contributors
- @xs-keju made their first contribution in #154
- @YouJiacheng made their first contribution in #165
- @penguin-wwy made their first contribution in #189
- @hyx1999 made their first contribution in #192
- @66RING made their first contribution in #196
- @zqh-wz made their first contribution in #202
- @chenghuaWang made their first contribution in #204
- @cherichy made their first contribution in #217
- @Alex4210987 made their first contribution in #213
- @yyttt6 made their first contribution in #252
- @oraluben made their first contribution in #251
Full Changelog: v0.1.2...v0.1.3