Skip to content

v0.1.3

Choose a tag to compare

@LeiWang1999 LeiWang1999 released this 23 Mar 15:21
· 1043 commits to main since this release
f308c8a

What's Changed

  • [Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
  • Add cpu jit with backend ctypes by @xs-keju in #154
  • [Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
  • [Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
  • [Enhancement] Improve CUDA path detection by @xwhzz in #157
  • [Refactor] Replace T.thread_binding with T.get_thread_binding in examples and test cases by @LeiWang1999 in #163
  • [Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
  • [Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
  • [Release] Bump version to v0.1.2.post1 by @LeiWang1999 in #166
  • Use SS-GEMM for PV in mla by @YouJiacheng in #165
  • [Example] Implement tilelang native sparse attention varlen example by @LeiWang1999 in #170
  • [Bugfix] Implement boundary check for the buffer shape with dynamic symbolic by @LeiWang1999 in #173
  • [AutoTune] Enable config-performance trace by @LeiWang1999 in #174
  • [Feat] Append Pass Context and TMA lowering configuration option by @LeiWang1999 in #175
  • [Feat] Introduce new caching mechanism for compiled kernels by @LeiWang1999 in #176
  • [Refactor] Enhance GPU Kernel Launch with Environment Thread Creation by @LeiWang1999 in #178
  • [Bugfix] Improve Thread Variable Handling in Layout Inference by @LeiWang1999 in #179
  • [Examples] Implement NSA Backward kernels by @LeiWang1999 in #180
  • [Enhancement] Optimize CMake build process with dynamic job count calculation by @LeiWang1999 in #183
  • [Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation by @LeiWang1999 in #185
  • [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug by @chengyupku in #188
  • [Dev] Add the failed nvcc command to the exception message by @penguin-wwy in #189
  • [Bugfix] Fix T.copy for scalar datatypes by @LeiWang1999 in #190
  • [Enhancement] Simplify GEMM example with direct kernel compilation by @LeiWang1999 in #191
  • [Bugfix] Make quickstart work properly on cu118 by @penguin-wwy in #193
  • [Language] Support clamp in language by @hyx1999 in #192
  • [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter by @chengyupku in #194
  • [Feature] Add TMA Store Synchronization Support by @chengyupku in #195
  • Update expired example code. by @66RING in #196
  • [CMake] Add CUDA Major Version Detection for Conditional Compilation by @chengyupku in #197
  • [Feature] Support Async Pipeline inference within if scope by @LeiWang1999 in #198
  • [Dev] Add new example for FlashAttention with pipelined execution by @chengyupku in #200
  • [Enhancement] Enhancing the handling of conditional statements in the pipeline by @LeiWang1999 in #201
  • [Feature] Upgrade cutlass version and support fp8 T.gemm by @zqh-wz in #202
  • [Docker] Update Dockerfiles to specify exact version of libstdcxx-ng by @LeiWang1999 in #203
  • [Dev] Add GQA backward example by @chengyupku in #205
  • [LICENSE] Typo fix in LICENSE by @LeiWang1999 in #208
  • [Enhancement] Allow mma fallback when wgmma is not supported by @LeiWang1999 in #206
  • [Examples] Expand tuning configurations for FlashAttention example by @chenghuaWang in #204
  • [Enhancement] Avoid tvm ffi handling when out_idx is specified by @LeiWang1999 in #209
  • [Fix] Fix K // block_K to T.ceildiv(K,block_K) and add tests by @hyx1999 in #210
  • [Dev] Implement IfStmtBinding and MergeIfStmt transformations by @chengyupku in #211
  • [Language] Introduce T.reshape and T.view by @LeiWang1999 in #212
  • [Enhancement] Improve device handling in Cython kernel adapter by @LeiWang1999 in #220
  • [Enhancement] Update format script to support force compare with upstream by @LeiWang1999 in #221
  • [Refactor] Introduce KernelParam integration across modules by @LeiWang1999 in #223
  • [Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper by @zqh-wz in #224
  • [Refactor] Update kernel compilation and profiling in examples by @chengyupku in #225
  • [Examples] Add fp8 gemm 2xAcc and deepgemm example by @cherichy in #217
  • [Doc] Add instructions for installing nightly version by @xwhzz in #226
  • [Bugfix] Disable force inline for ldmatrix by @LeiWang1999 in #227
  • [Bugfix] Support duplicate tma desc declaration by @LeiWang1999 in #228
  • [Refactor] Rename clamp functions and enhance dtype handling in tests by @LeiWang1999 in #232
  • [Enhancement] Simplify kernel source extraction in JIT adapters by @LeiWang1999 in #230
  • [Feature] Add reduce_max corresponding tests by @LeiWang1999 in #236
  • [BugFix] Fix bug of missing MBarrierExpectTX by @chengyupku in #241
  • [Refactor] Refactor for Better Layout Conflict Handling by @LeiWang1999 in #240
  • [Refactor] Align torch_assert_close tensor comparison with torch.testing.assert_close by @xwhzz in #239
  • [Dev] Implement FlashAttention3 Backward by @chengyupku in #244
  • [BugFix] Fix bug of mismatching dtype in testing by @xwhzz in #245
  • [Enhancement] Add zero initialization option to GEMM operations by @chengyupku in #246
  • [Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to minBlocksPerMultiprocesor by @cherichy in #248
  • [Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters by @Alex4210987 in #213
  • [Examples] Implement elementwise add kernel by @chenghuaWang in #219
  • [Refactor] Phaseout LLVM Dependency by Making it Optional by @LeiWang1999 in #247
  • [Readme] Update Bib Citation Section by @LeiWang1999 in #249
  • [Enhancement] Support float variable as arguments by @LeiWang1999 in #250
  • add autotune to example_gemm.py by @yyttt6 in #252
  • [Language] Introduce T.alloc_var to define a variable like int var; by @LeiWang1999 in #255
  • [Example] Implement Kernel Example cumsum by @LeiWang1999 in #258
  • [Refactor] Refactor CUDA post-processing callback registration in TileLang by @LeiWang1999 in #259
  • [Refactor] Move compilation outside critical section by @YouJiacheng in #260
  • [CI] Use auditwheel to generate manylinux wheels by @oraluben in #251
  • [Bugfix] Fix Benchmark/Example Code for Autotuning by @SiriusNEO in #254
  • [Language] Enhance alias to support blockwise memory load by @LeiWang1999 in #261
  • [Bugfix] Fix auto tuning tma handling by @LeiWang1999 in #263
  • [Release] Bump version to 0.1.3 by @LeiWang1999 in #264

New Contributors

Full Changelog: v0.1.2...v0.1.3