Skip to content

v0.1.7

Choose a tag to compare

@LeiWang1999 LeiWang1999 released this 07 Dec 03:09
· 105 commits to main since this release
305c854

What's Changed

  • [PATCH] Static libg++ linking fix by @LeiWang1999 in #854
  • [Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis by @LeiWang1999 in #856
  • [Doc] Optimize the quickstart guide for clarity and not just for CUDA by @LeiWang1999 in #858
  • [TMA] Bugfix when a shared buffer is both issued with tma store and tma load by @LeiWang1999 in #857
  • [AMD][MLA] Fix mla autotune for rocm by @LeiWang1999 in #861
  • [Bugfix] Ensure correct handling for cases where seq_q<seq_kv in flash attention examples by @Rachmanino in #864
  • [AMD] refactor MatrixCoreIntrinEmitter by @Paran0idy in #860
  • [Feat] Add fast sine and cosine definitions in CUDA templates by @Rachmanino in #865
  • [Layout] Support layout forward with multi dimension by @LeiWang1999 in #867
  • [Autotune][Conv] optimize convolution examples to use autotune by @LeiWang1999 in #866
  • [Example] Add examples to support efficient attention sink forward process by @Rachmanino in #853
  • [Parser] Adapt Parser to work with Python 3.8 in some cases by @LeiWang1999 in #869
  • [Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j] by @kurisu6912 in #798
  • [Language] Support sequence comparisons by @LeiWang1999 in #872
  • [Language] Support loop_break primitive by @chengyupku in #873
  • [Bugfix] Use ExprDeepEqual instead of StructuralEqual when merge consecutive If stmt by @LeiWang1999 in #876
  • [Language] Support atomic add with ret by @LeiWang1999 in #870
  • [Cython] Remove an incorrect check by @LJC00118 in #880
  • Update amd_ci.yml by @Alex4210987 in #881
  • [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke by @LeiWang1999 in #875
  • [Example] Add efficient attention sink backward implementations and tests by @Rachmanino in #877
  • [Precision] Introduce T.ieee_rsqrt and related high precision op by @LeiWang1999 in #882
  • [Dist] Provide an option to include commit ID in version by @LeiWang1999 in #884
  • [Example] Optimize sink attention forward via swizzled layout and report benchmark results by @Rachmanino in #885
  • [Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop by @LeiWang1999 in #844
  • [Bugfix][Enhancement] Fix a bug in previous commit and enhance cuda backend by @Hamerlate in #887
  • [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst by @Rachmanino in #888
  • [Layout] Fix plot layout by @Paran0idy in #890
  • [Example] Add example by @LeiWang1999 in #894
  • [News] Add announcement of support for Huawei Ascend chips by @xwhzz in #895
  • [Example] Add sparse mla examples by @LeiWang1999 in #896
  • [Typo] Fix backend name for Huawei Ascend by @xwhzz in #898
  • [CI] Legalize math related test by @LeiWang1999 in #899
  • [Bugfix] Fix flops comp and softmax scale in mla by @Edenzzzz in #900
  • [Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples by @LeiWang1999 in #913
  • [CI] optimize CI time for sparse gemm by @botbw in #906
  • [Enhancement] Include compile flags into the hash key of cached kernels by @Rachmanino in #911
  • [Bugfix] Fix saving kernel source code where JITKernel.artifact is None by @zjudmd1015 in #921
  • [CI] Refactor import paths in dequantization examples to use dequantize_utils by @LeiWang1999 in #914
  • [Example] Add MLA decode ws example by @chengyupku in #928
  • [CI] Fix documentation runner by adding 'nvidia' tag by @xwhzz in #927
  • [Layout] Strict annotate completed replicated layout for fragment with constant index by @LeiWang1999 in #929
  • [Bugfix] Fix tensor memory copy layout by @Hamerlate in #933
  • [Example] Optimize online_softmax example by @lijinpei in #934
  • [Example] Add correctness assert into dsa example by @LeiWang1999 in #937
  • [Enhancement] Enhance and add new GQA backward examples for Hopper by @Rachmanino in #930
  • [Enhancement] Fix lint to improve grouped GEMM performance with TMA by @Cunxiao2002 in #938
  • [Example] Introduce split+sum template, and optimize atomic_add performance for bwd examples by @LeiWang1999 in #940
  • [Example] Disable TMA and enable FastMath for NSA Examples (#941) by @LeiWang1999 in #941
  • [Example] Revert the atomic/split&sum templates in MHA backward examples by @Rachmanino in #943
  • [Example] Add sparse mla bwd example for deepseek_v32 by @Zhichenzzz in #919
  • [Profiler]Adds CUPTI profiler support by @Cunxiao2002 in #936
  • [Enhancement] Support Copy for Buffer Load witih scalar indices by @LeiWang1999 in #946
  • [Code Style] Refine nvrtc compile related check style by @BBuf in #945
  • [Backend] Add metal backend by @oraluben in #799
  • [CI] enable dependabot for GHA workflows by @XuehaiPan in #950
  • Modify the SM architecture number to support Thor’s sm110. by @iloveai8086 in #957
  • [CI] auto-cancel in-progress PR CI when new commits are pushed by @XuehaiPan in #956
  • [bug] fix type object is not subscriptable in py38 by @BBuf in #959
  • [Bugfix][Doc] Add astroid version constraint to requirements.txt by @xwhzz in #958
  • [CI]: Bump actions/setup-python from 2 to 6 by @dependabot[bot] in #951
  • [CI]: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #952
  • [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #954
  • [CI]: Bump actions/checkout from 2 to 5 by @dependabot[bot] in #953
  • [TileOp] Implement WGMMA for T.gemm_v2 by @LeiWang1999 in #813
  • [Docs] add CODE_OF_CONDUCT.md by @XuehaiPan in #965
  • [Example] Add support for bfloat16 and user-defined sm_scale in attention sink examples by @Rachmanino in #924
  • [Bugfix] Do not force inline let stmt by @LeiWang1999 in #947
  • [CI] add pre-commit integration by @XuehaiPan in #955
  • [Doc] Install docs add docker install method by @BBuf in #961
  • [Bugfix] Fix dummy kernel compliation by @SiriusNEO in #962
  • [CI][Refactor] Refactor non-test CI workflow files by @XuehaiPan in #971
  • [TileOp] Implememt CumSum1D by @LeiWang1999 in #978
  • [Language] Enhance T.alloc_var for AugAssign and AnnAsign by @LeiWang1999 in #979
  • [Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend by @LeiWang1999 in #977
  • [Typo] Remove debug print by @LeiWang1999 in #980
  • [Bugfix] Use access_ptr("r") instead of access_ptr("w") for correct pipeline analysis by @LeiWang1999 in #983
  • [Feature][Example] Support TMA reduce operation and update GQA bwd example by @chengyupku in #969
  • [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) by @Degeneracy-Evil in #976
  • [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures by @tzj-fxz in #984
  • [Bugfix] Fallback torch.accelerator.synchronize() to torch.cuda.synchronize() by @yyttt6 in #987
  • [Bugfix]:Fix atomicadd auto vectorize identify var error by @yyttt6 in #883
  • [CI] Speed up sparse tensor core test via vectorized generating sparse data by @LeiWang1999 in #1009
  • [Build] Migrate to scikit-build-core by @oraluben in #939
  • [CI] Removes redundant environment variable by @Cunxiao2002 in #1020
  • [Transform] Migrate LowerIntrin from tvm into tilelang by @LeiWang1999 in #999
  • [Lint] Prefer American English spelling by @XuehaiPan in #1022
  • [Build] Prefer libs from local build dir by @oraluben in #1027
  • [Language] Support Consequential assignments like 'a = b = c = 1' by @LeiWang1999 in #992
  • [CI] Removes debug print statements from the example. by @Cunxiao2002 in #1030
  • [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation by @Rachmanino in #1023
  • [Bugfix] Recover code for flexible parallel by @LeiWang1999 in #1032
  • [CI] Disable buggy(maybe) warp specialized kernel ci test for H20 by @LeiWang1999 in #1033
  • [TIR] Revert some changes of Pass LowerIntrin by @LeiWang1999 in #1035
  • [Env] Optimize the mechanism for locating TL_LIBS by @LeiWang1999 in #1038
  • [CUDA] Add pack functions for FP8 types by @LJC00118 in #967
  • [Language] Expose T.get_warp_idx_sync and T.shuffle_elect for efficient thread election by @LeiWang1999 in #989
  • [AMD] fix bug&add amd fp8 examples by @Alex4210987 in #966
  • [CI][Refactor] Merge test CI workflow files into one by @XuehaiPan in #973
  • [BugFix] Phaseout dependency of Triton in sink examples to make CI happy by @Rachmanino in #1045
  • [Refactor] Use has_simt_copy to decide whether to insert set_max_nreg by @chengyupku in #982
  • [Feature]: Add test for atomicadd auto vectorize and remove useless code by @yyttt6 in #1019
  • Allow mma gemm for all cuda arch by @oraluben in #1047
  • [Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. by @LeiWang1999 in #1051
  • [CI] Fix ROCm CI by @XuehaiPan in #1043
  • [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper by @Rachmanino in #1024
  • Automatically initialize submodule if missing by @LeiWang1999 in #1052
  • [Enhancement] Remove constraint requiring last dimension stride to be 1 by @LJC00118 in #1040
  • [CI] Disable autofix for pre-commit CI by @LeiWang1999 in #1053
  • [Enhancement] Improve CUDA compiler detection in CMake by @LJC00118 in #1054
  • [Enhancement] Introduce a workaround for layout inference for local buffer store by @LeiWang1999 in #1055
  • [Refactor] Refactor Pass LegalizeSafeMemoryAccess to support recursive load/store rewrite by @SiriusNEO in #1050
  • Making version parser more robust against missing or unavailable metadata by @LeiWang1999 in #1061
  • [DOC] Add document for develop with PYTHONPATH by @LeiWang1999 in #1062
  • [CI]:Reduce test shapes to avoid OOM errors during CI. by @yyttt6 in #1060
  • [Benchmark] Add H800 SXM Benchmark results by @LeiWang1999 in #1063
  • [Misc] Add GitHub issue templates by @XuehaiPan in #1057
  • [Refactor][Example] Update linear attention examples and add tests by @Rachmanino in #1010
  • [Enhancement] Deprecate split&sum in attn bwd examples on Hopper by @Rachmanino in #1065
  • [Benchmark] Add matmul FP16 benchmark results by @LeiWang1999 in #1067
  • [CI]: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #1070
  • [Example] Update GQA varlen fwd and MHA varlen fwd by @chengyupku in #1071
  • [Parallel] Support T.Parallel with dynamic extents by @LeiWang1999 in #990
  • [Layout] Utilizing IsEqual instead of StructuralEqual by @LeiWang1999 in #1073
  • [Cache] raise errors for tileang.clear_cache() by @LeiWang1999 in #1077
  • [Feature] Support Reduce operators for bitwise and/or/xor by @tzj-fxz in #1074
  • [Autotune] Add autotune coverage for symbolic M and normalize cache key by @LeiWang1999 in #1075
  • [Language] Recommend using T.dynamic instead of T.symbolic by @LeiWang1999 in #1076
  • [Language] Efficient T.reduce_ with shared memory input/output by @LeiWang1999 in #1080
  • [Bugfix] Fix missing reg alloc in custom warp specialization by @chengyupku in #1084
  • [Enhancement] Update async intrinsic handling in inject_fence_proxy by @Rachmanino in #1068
  • [Feature] Add GQA backward kernel with varlen input by @tzj-fxz in #1082
  • [BugFix] Add memory order argument for non-vectorized atomic add by @tzj-fxz in #1081
  • [Refactor] Rename cython output to tilelang_cython and relocate its path by @LeiWang1999 in #1086
  • [Target] Enhance target selection helpers and documentation by @LeiWang1999 in #1085
  • [Cleanup] Remove tilelang.disable_cache() calls from examples and tests by @Rachmanino in #1088
  • [PassConfig] Introduce PassConfig TL_STORAGE_REWRITE_DETECT_INPLACE by @LeiWang1999 in #1089
  • [Language] Support tilelang alloc_var(dtype, init=x) by @LeiWang1999 in #1092
  • [Bugfix] Fix missing host cuTensorMapEncodeIm2col call by @chengyupku in #1094
  • [GQA] Add regional atomic add to slightly boost performance by @tzj-fxz in #1093
  • [Example] Add block level high performance gemv example by @LeiWang1999 in #1097
  • [Refactor] Optimize debug message for parallel inference by @LeiWang1999 in #1096
  • [CI][Lint] Retire format.sh and add clang-tidy to GHA workflow by @XuehaiPan in #1044
  • [Refactor] Use forceinline in ldmatrix and update mamba scan kernel by @chengyupku in #1104
  • [Maint] Update uncommitted change detection command in format.sh by @XuehaiPan in #1102
  • [Benchmark] Add Mamba2_chunk_scan benchmark by @chengyupku in #1109
  • [Benchmark] Update Mamba2_chunk_scan benchmark by @chengyupku in #1110
  • [Lint] Enable pyupgrade linter in ruff by @oraluben in #963
  • [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi by @LeiWang1999 in #1111
  • [Feature] Enhance vectorized conversion support in CUDA codegen by @Rachmanino in #1095
  • [Feature] Support None type as input for T.ptr and T.Tensor by @xwhzz in #1114
  • [Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) by @LeiWang1999 in #1119
  • [Feature] Add memory_order PTX for vectorized atomic add by @tzj-fxz in #1112
  • [CI]: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1128
  • [CI]: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1127
  • [Enhancement] Add missing fence_barrier_init primitive after mbarrier init by @chengyupku in #1121
  • [Feature]:Add device assert by @yyttt6 in #1116
  • [Build][CI] Build and test SDist in release CI by @XuehaiPan in #1098
  • [Benchmark] Update triton and helion baselines in mamba-chuk-scan by @chengyupku in #1131
  • Add int2 and longlong4 pack functions by @LJC00118 in #1129
  • [BugFix] Add memory order and testing script for split version GQA bwd kernel by @tzj-fxz in #1100
  • [Bugfix] Correctly construct the argument list for atomic add based on the vector size by @LeiWang1999 in #1137
  • [AMD] Supoort T.gemm_v2 for AMD Backend by @Paran0idy in #1136
  • [BugFix] alloc_var init failed to handle complex expression by @kurisu6912 in #1144
  • [Refactor] Remove amd gemm_v2 tests by @LeiWang1999 in #1149
  • [BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values by @Rachmanino in #1143
  • [Bugfix] Implement classic arena algorithm for shmem merge and WAW conflict detection by @LeiWang1999 in #1146
  • [CI] allow dirty workspace for format.sh and introduce loop carry thread sync unit test by @LeiWang1999 in #1153
  • [CI] use Python urllib to download file instead of Wget by @XuehaiPan in #1154
  • [BugFix] Correct direct copy from bf16 to fp8 by @Cunxiao2002 in #1090
  • [Refactor]:Move device_assert from extern_call to intrin_call by @yyttt6 in #1134
  • [Enhancement] Enhance Cast operations Vectorization by @LJC00118 in #1156
  • [Bugfix] Enhance LetStmt handling in Vectorize Loop Pass by @LeiWang1999 in #1159
  • [Release] Bump version to v0.1.6.post2 by @LeiWang1999 in #1160
  • [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi by @LeiWang1999 in #1108
  • [Bugfix] Enable code lowering with producer‑copy‑only program by @LeiWang1999 in #1168
  • [Bugfix] Support 16bits shfl_sync by @LeiWang1999 in #1169
  • [Testing] Move TMA 1D and test for its functionality by @tzj-fxz in #1167
  • [Refactor]: Change the params in pytest to avoid oom error during ci by @yyttt6 in #1170
  • [Bugfix] Fix tvm import path for editable build by @LeiWang1999 in #1172
  • [Language] Expose T.warpgroup_fence_operand for nvcc code motion by @LeiWang1999 in #986
  • [Language] Add Correctness and performance check scripts for V2 by @LeiWang1999 in #1174
  • [Bugfix] Legalize Datatype for mma intrinisc codegen by @LeiWang1999 in #1179
  • [CI]: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1177
  • [CI]: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1178
  • [Language] Initial version of tilelang frontend v2 by @kurisu6912 in #1120
  • [Fix] fix type imcompatible error in #1115 by @kurisu6912 in #1180
  • [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1183
  • [Fix] Remove unsupported type params by @kurisu6912 in #1186
  • [Feature] Enhance fill operation to support various buffer types by @LeiWang1999 in #1189
  • [Refactor] Improve Python3.9 compatibility for ParamSpec and Self by @LeiWang1999 in #1190
  • [Feat] Add swap like grammar in tuple assignment by @kurisu6912 in #1185
  • [Release] Unify local build scripts to use cibuildwheel and reduce size of sdist by @oraluben in #1171
  • [Langauge] Support n>256 for v2 by @LeiWang1999 in #1182
  • [GQA] Use TMA in GQA bwd kernel to boost performance by @tzj-fxz in #1176
  • [Example] Update GQA varlen fwd by @chengyupku in #1173
  • [Refactor] Dynamic registration of FP8 data type for compatibility with older PyTorch versions by @LeiWang1999 in #1197
  • [Feature] Add tl.infinity operator for infinity handling of bfloat16 by @Rachmanino in #1175
  • [SM70] Refactor and minor fix for SM70 by @LeiWang1999 in #1195
  • [CI] Enable ccache for CIBW on Linux by @oraluben in #1184
  • [Feat] Add support for T.serial with step and negative step by @kurisu6912 in #1188
  • [Feat] Add A Pass to Handle Negative Index by @kurisu6912 in #1192
  • Fix type errors in reduce.h by @LJC00118 in #1204
  • [Bugfix] Improves the accuracy of dependency analysis in the storage access by @LeiWang1999 in #1205
  • [Bugfix][Language V2] Capture closure variables from program by @LeiWang1999 in #1206
  • Fix Dockerfile.cu128 by @createthis in #1208
  • [Enhancement] Improve handling of negative indices for ramp and broadcast node by @LeiWang1999 in #1207
  • [Bugfix] Enhane LetStmt Handling in Pipeline Transform by @LeiWang1999 in #1212
  • [Fix] Fix buffer re-import typo in tilelang.languge by @kurisu6912 in #1214
  • [Build] Explicitly add libtvm as a dep of libtilelang by @oraluben in #1215
  • [Utils] Add source export, NVCC-based PTX/SASS dump, logging by @LeiWang1999 in #1216
  • [Bugfix] Improve error handling in LayoutNode InverseWithLevel by @LeiWang1999 in #1220
  • [Enhancement] Improve iterator handling in layout utilities and parallel operations by @LeiWang1999 in #1221
  • [Language] Refactor reduce and support shared memory as its in/out by @LeiWang1999 in #1219
  • [GQA] Add varlen decoding kernel with logits saving by @tzj-fxz in #1223
  • [Enhancement] Add thread count validation for ReduceOp fragment layout inference by @LeiWang1999 in #1225
  • [Refactor] Simplify logic in the CompleteBufferFragment by @LeiWang1999 in #1226
  • [Refactor] Refactor version retrieval logic in tilelang package by @LeiWang1999 in #1227
  • [CPU] Minor fix for cpu backend by @LeiWang1999 in #1230
  • [Feature] Add Release Plan issue template by @LeiWang1999 in #1231
  • [Fix] Fix a type that make wrong T.macro backtrace by @kurisu6912 in #1234
  • [Refactor] Add kernel selection option for GEMM v1 in environment settings by @LeiWang1999 in #1200
  • [Bugfix] Minor fix in builder.py by @LJC00118 in #1235
  • [Language] Add type stubs for tir op by @kurisu6912 in #1239
  • [Enhancement] Support Layout/Fragment Reshape by @LeiWang1999 in #1241
  • [Bugfix] Minor fix for tcgen05 by @LeiWang1999 in #1242
  • RMSNorm epsilon refine in the example by @pengxin99 in #1243
  • [AMD] enable amd ci test & fix bug & fix dockerfile by @Paran0idy in #1244
  • [Refactor] Phaseout legacy loop vectorize dynamic pass by @LeiWang1999 in #1245
  • [Bugfix] Fix fp8 dtype for some cases by @LeiWang1999 in #1246
  • [Minor] Remove git_commit.txt by @SiriusNEO in #1249
  • [Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape by @LeiWang1999 in #1248
  • [Refactor] Update buffer handling in copy and atomic operations by @LeiWang1999 in #1247
  • [Language] Add missing while statement by @kurisu6912 in #1254
  • [BugFix] Add autotune and exp2 for GDN kernel by @tzj-fxz in #1258
  • [BugFix] Refactor attention kernel to handle OOB positions by filling with -inf instead of clearing accumulators. by @Rachmanino in #1222
  • [fix] NVRTC execution backend by @lucifer1004 in #1256
  • [AMD] Update CK for ROCm7 by @Paran0idy in #1262
  • [BugFix] Remove memory_order in atomic constexpr and fix NSA bwd by @KevinZeng08 in #1260
  • [Example] Add GQA decoding kernel with varlen page table by @tzj-fxz in #1265
  • [Refactor] add support for numpy dtype conversion by @kurisu6912 in #1255
  • [EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability by @vpj in #1148
  • [Docs] Improve Installation Guide by @SiriusNEO in #1270
  • [Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity by @Rachmanino in #1269
  • [Bugfix] Fix multiple cg defination when using T.sync_grid by @chengyupku in #1272
  • [Minor] Remove from __future__ import annotations for python 3.8 by @oraluben in #1273
  • [BugFix] Adding extra parameters into autotune hashkey by @SiriusNEO in #1274
  • Fix various issues under int64_t static and dynamic shape. by @Elevator14B in #1218
  • Bug fix for Gated Delta Net benchmark script by @learning-chip in #1267
  • [Bugfix] Minor fix for some cases by @LeiWang1999 in #1278
  • [Language] Add shape check in T.view/reshape by @SiriusNEO in #1277
  • [FFI] Use tvm ffi as the default execution backend by @LeiWang1999 in #1259
  • [Bugfix] Supply missing T.print for bool type by @LeiWang1999 in #1279
  • [Fix] Fix memory leak bug by @kurisu6912 in #1281
  • [Enhancement] Enhance CUDA compilation by integrating pass context configuration by @LeiWang1999 in #1283
  • Fix the bug in issue #1266 by @sea-with-sakura in #1284
  • [Language][UX] Nested loop checker in pre-lowering stage by @SiriusNEO in #1288
  • [Compatibility] Support CUDA 11.3 by @LeiWang1999 in #1290
  • [Feat] Add support for using T.Tensor(n * 2 + 1) in function annotation by @kurisu6912 in #1285
  • [Feat] Add missing support to pass reference by T.Var annotation by @kurisu6912 in #1291
  • [Enhancement] Shared Memory Size Can be Dynamic by @LeiWang1999 in #1294
  • [Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 by @kurisu6912 in #1305
  • [Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures by @LeiWang1999 in #1306
  • [Fix] Fix frame scope error in T.macro by @kurisu6912 in #1308
  • [WIP] support more dtypes for tcgen05 by @PannenetsF in #1229
  • Improve memory access safety and T.assume handling by @LJC00118 in #1292
  • [Bugfix] Fix autotune cache by @LeiWang1999 in #1315
  • [Refactor] Backup Analyzer to get the appropriate arith informations by @LeiWang1999 in #1311
  • Revert "[WIP] support more dtypes for tcgen05 (#1229)" by @LeiWang1999 in #1323
  • [CI]: Bump actions/checkout from 5 to 6 by @dependabot[bot] in #1319
  • [CI]: Bump pypa/cibuildwheel from 3.2 to 3.3 by @dependabot[bot] in #1318
  • [Installation] Fix building using customized TVM path by @SiriusNEO in #1326
  • [Release] Allow developer with write permission to trigger wheel release by @oraluben in #1322
  • [Feat] Support warp reduce by @Rachmanino in #1316
  • [Enhancement] Support more dtype in T.print by @xwhzz in #1329
  • [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape by @SiriusNEO in #1321
  • [Fix] Fix uint narrowing bug in #1310 by @kurisu6912 in #1320
  • [Refactor] Disable strided buffer load inside tvm (#1301) by @kurisu6912 in #1332
  • [Refactor] Moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils by @LeiWang1999 in #1333
  • [Fix] Fix bug copying from or to local buffer (#1304) by @kurisu6912 in #1324
  • [Language][UX] Semantic check for parallel fragment access by @SiriusNEO in #1338
  • Add unit tests for T.assume by @LJC00118 in #1341
  • [Feat] Extend LegalizeNegativeIndex to support buffer store stmts by @ConvolutedDog in #1339
  • [Refactor] Phaseout vmap for Tile Operators by @LeiWang1999 in #1334
  • [Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 by @PannenetsF in #1327
  • [Refactor] Enhance CopyNode's IterVar Creation and Range Handling by @LeiWang1999 in #1346
  • [Fix] Fix missing not operator in frontend (#1347) by @kurisu6912 in #1348
  • [Enhancement] Add support for k_pack in gemm_mfma by @Gongen-Ali in #1344
  • Add sparse fine-tuning kernel for deepseek sparse attention to example by @hyx1999 in #1296
  • [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder by @LeiWang1999 in #1352
  • [Refactor] Simplify index sign state handling in LegalizeNegativeIndex by @LeiWang1999 in #1354
  • [Enhancement] Improve error handling and assertion messages across runtime and argument binding by @LeiWang1999 in #1356
  • [Bugfix] Disable floordiv optimization due to integer overflow risk by @LJC00118 in #1355
  • [Bugfix] Fix the jit_kernel issue by @gfvvz in #1357
  • [Bugfix] Bind thread range for fragment inference in Parallel strict layout inference stage. by @LeiWang1999 in #1359
  • [Analysis] Enhance NestedLoopChecker with tile op cases by @SiriusNEO in #1358
  • [Language] support T.gemm_sp_v2 on sm80 and sm89 by @botbw in #1056
  • [Bugfix] Update TIR registration for GemmSPPy to use tile operation by @LeiWang1999 in #1361
  • [Enhancement] Implement dynamic unroll factor in CUDA code generation by @LeiWang1999 in #1360
  • [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1362
  • [Bugfix] Remove debug print in PyStmtFunctionVisitor by @LeiWang1999 in #1363
  • [Debug] Always include line info in NVCC command for improved profiling by @LeiWang1999 in #1364
  • [Enhancemnet] Minor fix to speed up testing by @LeiWang1999 in #1365
  • [Enhancement] Add DISABLE_CACHE environment variables by @SiriusNEO in #1368
  • [Refactor]: Remove useless include in atomicadd_vectorize.h by @yyttt6 in #1371
  • [Refactor] Generalize fp8 process by @LeiWang1999 in #1372
  • [Layout] Enhance Free Layout Inference by @LeiWang1999 in #1375
  • [Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations by @LeiWang1999 in #1376
  • [Tool] Provide layout visualization tool by @Cunxiao2002 in #1353
  • [Release] Relax constraint of tvm-ffi to compatible version by @oraluben in #1373
  • [Language] Tilelang LazyJIT Experimental Version by @kurisu6912 in #1337
  • [Builder] Enhance variable name binding and scope management by @LeiWang1999 in #1378
  • [Bugfix] make cuda driver api compat with cuda12/13, along with tests by @PannenetsF in #1379
  • [Fix] typo in cuda attr by @PannenetsF in #1380
  • [Language V2] Minor fix for complex annotations by @LeiWang1999 in #1381
  • [Release] Bump Version into 0.1.7 by @LeiWang1999 in #1377
  • [Typing] Enhance compatibility for advanced typing features for Py39 by @LeiWang1999 in #1382

New Contributors

Full Changelog: 0.1.6...v0.1.7