Release v0.1.7 · tile-ai/tilelang

What's Changed

[PATCH] Static libg++ linking fix by @LeiWang1999 in #854
[Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis by @LeiWang1999 in #856
[Doc] Optimize the quickstart guide for clarity and not just for CUDA by @LeiWang1999 in #858
[TMA] Bugfix when a shared buffer is both issued with tma store and tma load by @LeiWang1999 in #857
[AMD][MLA] Fix mla autotune for rocm by @LeiWang1999 in #861
[Bugfix] Ensure correct handling for cases where seq_q<seq_kv in flash attention examples by @Rachmanino in #864
[AMD] refactor MatrixCoreIntrinEmitter by @Paran0idy in #860
[Feat] Add fast sine and cosine definitions in CUDA templates by @Rachmanino in #865
[Layout] Support layout forward with multi dimension by @LeiWang1999 in #867
[Autotune][Conv] optimize convolution examples to use autotune by @LeiWang1999 in #866
[Example] Add examples to support efficient attention sink forward process by @Rachmanino in #853
[Parser] Adapt Parser to work with Python 3.8 in some cases by @LeiWang1999 in #869
[Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j] by @kurisu6912 in #798
[Language] Support sequence comparisons by @LeiWang1999 in #872
[Language] Support loop_break primitive by @chengyupku in #873
[Bugfix] Use ExprDeepEqual instead of StructuralEqual when merge consecutive If stmt by @LeiWang1999 in #876
[Language] Support atomic add with ret by @LeiWang1999 in #870
[Cython] Remove an incorrect check by @LJC00118 in #880
Update amd_ci.yml by @Alex4210987 in #881
[FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke by @LeiWang1999 in #875
[Example] Add efficient attention sink backward implementations and tests by @Rachmanino in #877
[Precision] Introduce T.ieee_rsqrt and related high precision op by @LeiWang1999 in #882
[Dist] Provide an option to include commit ID in version by @LeiWang1999 in #884
[Example] Optimize sink attention forward via swizzled layout and report benchmark results by @Rachmanino in #885
[Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop by @LeiWang1999 in #844
[Bugfix][Enhancement] Fix a bug in previous commit and enhance cuda backend by @Hamerlate in #887
[Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst by @Rachmanino in #888
[Layout] Fix plot layout by @Paran0idy in #890
[Example] Add example by @LeiWang1999 in #894
[News] Add announcement of support for Huawei Ascend chips by @xwhzz in #895
[Example] Add sparse mla examples by @LeiWang1999 in #896
[Typo] Fix backend name for Huawei Ascend by @xwhzz in #898
[CI] Legalize math related test by @LeiWang1999 in #899
[Bugfix] Fix flops comp and softmax scale in mla by @Edenzzzz in #900
[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples by @LeiWang1999 in #913
[CI] optimize CI time for sparse gemm by @botbw in #906
[Enhancement] Include compile flags into the hash key of cached kernels by @Rachmanino in #911
[Bugfix] Fix saving kernel source code where JITKernel.artifact is None by @zjudmd1015 in #921
[CI] Refactor import paths in dequantization examples to use dequantize_utils by @LeiWang1999 in #914
[Example] Add MLA decode ws example by @chengyupku in #928
[CI] Fix documentation runner by adding 'nvidia' tag by @xwhzz in #927
[Layout] Strict annotate completed replicated layout for fragment with constant index by @LeiWang1999 in #929
[Bugfix] Fix tensor memory copy layout by @Hamerlate in #933
[Example] Optimize online_softmax example by @lijinpei in #934
[Example] Add correctness assert into dsa example by @LeiWang1999 in #937
[Enhancement] Enhance and add new GQA backward examples for Hopper by @Rachmanino in #930
[Enhancement] Fix lint to improve grouped GEMM performance with TMA by @Cunxiao2002 in #938
[Example] Introduce split+sum template, and optimize atomic_add performance for bwd examples by @LeiWang1999 in #940
[Example] Disable TMA and enable FastMath for NSA Examples (#941) by @LeiWang1999 in #941
[Example] Revert the atomic/split&sum templates in MHA backward examples by @Rachmanino in #943
[Example] Add sparse mla bwd example for deepseek_v32 by @Zhichenzzz in #919
[Profiler]Adds CUPTI profiler support by @Cunxiao2002 in #936
[Enhancement] Support Copy for Buffer Load witih scalar indices by @LeiWang1999 in #946
[Code Style] Refine nvrtc compile related check style by @BBuf in #945
[Backend] Add metal backend by @oraluben in #799
[CI] enable dependabot for GHA workflows by @XuehaiPan in #950
Modify the SM architecture number to support Thor’s sm110. by @iloveai8086 in #957
[CI] auto-cancel in-progress PR CI when new commits are pushed by @XuehaiPan in #956
[bug] fix type object is not subscriptable in py38 by @BBuf in #959
[Bugfix][Doc] Add astroid version constraint to requirements.txt by @xwhzz in #958
[CI]: Bump actions/setup-python from 2 to 6 by @dependabot[bot] in #951
[CI]: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #952
[CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #954
[CI]: Bump actions/checkout from 2 to 5 by @dependabot[bot] in #953
[TileOp] Implement WGMMA for T.gemm_v2 by @LeiWang1999 in #813
[Docs] add CODE_OF_CONDUCT.md by @XuehaiPan in #965
[Example] Add support for bfloat16 and user-defined sm_scale in attention sink examples by @Rachmanino in #924
[Bugfix] Do not force inline let stmt by @LeiWang1999 in #947
[CI] add pre-commit integration by @XuehaiPan in #955
[Doc] Install docs add docker install method by @BBuf in #961
[Bugfix] Fix dummy kernel compliation by @SiriusNEO in #962
[CI][Refactor] Refactor non-test CI workflow files by @XuehaiPan in #971
[TileOp] Implememt CumSum1D by @LeiWang1999 in #978
[Language] Enhance T.alloc_var for AugAssign and AnnAsign by @LeiWang1999 in #979
[Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend by @LeiWang1999 in #977
[Typo] Remove debug print by @LeiWang1999 in #980
[Bugfix] Use access_ptr("r") instead of access_ptr("w") for correct pipeline analysis by @LeiWang1999 in #983
[Feature][Example] Support TMA reduce operation and update GQA bwd example by @chengyupku in #969
[Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) by @Degeneracy-Evil in #976
[BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures by @tzj-fxz in #984
[Bugfix] Fallback torch.accelerator.synchronize() to torch.cuda.synchronize() by @yyttt6 in #987
[Bugfix]:Fix atomicadd auto vectorize identify var error by @yyttt6 in #883
[CI] Speed up sparse tensor core test via vectorized generating sparse data by @LeiWang1999 in #1009
[Build] Migrate to scikit-build-core by @oraluben in #939
[CI] Removes redundant environment variable by @Cunxiao2002 in #1020
[Transform] Migrate LowerIntrin from tvm into tilelang by @LeiWang1999 in #999
[Lint] Prefer American English spelling by @XuehaiPan in #1022
[Build] Prefer libs from local build dir by @oraluben in #1027
[Language] Support Consequential assignments like 'a = b = c = 1' by @LeiWang1999 in #992
[CI] Removes debug print statements from the example. by @Cunxiao2002 in #1030
[Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation by @Rachmanino in #1023
[Bugfix] Recover code for flexible parallel by @LeiWang1999 in #1032
[CI] Disable buggy(maybe) warp specialized kernel ci test for H20 by @LeiWang1999 in #1033
[TIR] Revert some changes of Pass LowerIntrin by @LeiWang1999 in #1035
[Env] Optimize the mechanism for locating TL_LIBS by @LeiWang1999 in #1038
[CUDA] Add pack functions for FP8 types by @LJC00118 in #967
[Language] Expose T.get_warp_idx_sync and T.shuffle_elect for efficient thread election by @LeiWang1999 in #989
[AMD] fix bug&add amd fp8 examples by @Alex4210987 in #966
[CI][Refactor] Merge test CI workflow files into one by @XuehaiPan in #973
[BugFix] Phaseout dependency of Triton in sink examples to make CI happy by @Rachmanino in #1045
[Refactor] Use has_simt_copy to decide whether to insert set_max_nreg by @chengyupku in #982
[Feature]: Add test for atomicadd auto vectorize and remove useless code by @yyttt6 in #1019
Allow mma gemm for all cuda arch by @oraluben in #1047
[Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. by @LeiWang1999 in #1051
[CI] Fix ROCm CI by @XuehaiPan in #1043
[Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper by @Rachmanino in #1024
Automatically initialize submodule if missing by @LeiWang1999 in #1052
[Enhancement] Remove constraint requiring last dimension stride to be 1 by @LJC00118 in #1040
[CI] Disable autofix for pre-commit CI by @LeiWang1999 in #1053
[Enhancement] Improve CUDA compiler detection in CMake by @LJC00118 in #1054
[Enhancement] Introduce a workaround for layout inference for local buffer store by @LeiWang1999 in #1055
[Refactor] Refactor Pass LegalizeSafeMemoryAccess to support recursive load/store rewrite by @SiriusNEO in #1050
Making version parser more robust against missing or unavailable metadata by @LeiWang1999 in #1061
[DOC] Add document for develop with PYTHONPATH by @LeiWang1999 in #1062
[CI]:Reduce test shapes to avoid OOM errors during CI. by @yyttt6 in #1060
[Benchmark] Add H800 SXM Benchmark results by @LeiWang1999 in #1063
[Misc] Add GitHub issue templates by @XuehaiPan in #1057
[Refactor][Example] Update linear attention examples and add tests by @Rachmanino in #1010
[Enhancement] Deprecate split&sum in attn bwd examples on Hopper by @Rachmanino in #1065
[Benchmark] Add matmul FP16 benchmark results by @LeiWang1999 in #1067
[CI]: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #1070
[Example] Update GQA varlen fwd and MHA varlen fwd by @chengyupku in #1071
[Parallel] Support T.Parallel with dynamic extents by @LeiWang1999 in #990
[Layout] Utilizing IsEqual instead of StructuralEqual by @LeiWang1999 in #1073
[Cache] raise errors for tileang.clear_cache() by @LeiWang1999 in #1077
[Feature] Support Reduce operators for bitwise and/or/xor by @tzj-fxz in #1074
[Autotune] Add autotune coverage for symbolic M and normalize cache key by @LeiWang1999 in #1075
[Language] Recommend using T.dynamic instead of T.symbolic by @LeiWang1999 in #1076
[Language] Efficient T.reduce_ with shared memory input/output by @LeiWang1999 in #1080
[Bugfix] Fix missing reg alloc in custom warp specialization by @chengyupku in #1084
[Enhancement] Update async intrinsic handling in inject_fence_proxy by @Rachmanino in #1068
[Feature] Add GQA backward kernel with varlen input by @tzj-fxz in #1082
[BugFix] Add memory order argument for non-vectorized atomic add by @tzj-fxz in #1081
[Refactor] Rename cython output to tilelang_cython and relocate its path by @LeiWang1999 in #1086
[Target] Enhance target selection helpers and documentation by @LeiWang1999 in #1085
[Cleanup] Remove tilelang.disable_cache() calls from examples and tests by @Rachmanino in #1088
[PassConfig] Introduce PassConfig TL_STORAGE_REWRITE_DETECT_INPLACE by @LeiWang1999 in #1089
[Language] Support tilelang alloc_var(dtype, init=x) by @LeiWang1999 in #1092
[Bugfix] Fix missing host cuTensorMapEncodeIm2col call by @chengyupku in #1094
[GQA] Add regional atomic add to slightly boost performance by @tzj-fxz in #1093
[Example] Add block level high performance gemv example by @LeiWang1999 in #1097
[Refactor] Optimize debug message for parallel inference by @LeiWang1999 in #1096
[CI][Lint] Retire format.sh and add clang-tidy to GHA workflow by @XuehaiPan in #1044
[Refactor] Use forceinline in ldmatrix and update mamba scan kernel by @chengyupku in #1104
[Maint] Update uncommitted change detection command in format.sh by @XuehaiPan in #1102
[Benchmark] Add Mamba2_chunk_scan benchmark by @chengyupku in #1109
[Benchmark] Update Mamba2_chunk_scan benchmark by @chengyupku in #1110
[Lint] Enable pyupgrade linter in ruff by @oraluben in #963
[Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi by @LeiWang1999 in #1111
[Feature] Enhance vectorized conversion support in CUDA codegen by @Rachmanino in #1095
[Feature] Support None type as input for T.ptr and T.Tensor by @xwhzz in #1114
[Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) by @LeiWang1999 in #1119
[Feature] Add memory_order PTX for vectorized atomic add by @tzj-fxz in #1112
[CI]: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1128
[CI]: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1127
[Enhancement] Add missing fence_barrier_init primitive after mbarrier init by @chengyupku in #1121
[Feature]:Add device assert by @yyttt6 in #1116
[Build][CI] Build and test SDist in release CI by @XuehaiPan in #1098
[Benchmark] Update triton and helion baselines in mamba-chuk-scan by @chengyupku in #1131
Add int2 and longlong4 pack functions by @LJC00118 in #1129
[BugFix] Add memory order and testing script for split version GQA bwd kernel by @tzj-fxz in #1100
[Bugfix] Correctly construct the argument list for atomic add based on the vector size by @LeiWang1999 in #1137
[AMD] Supoort T.gemm_v2 for AMD Backend by @Paran0idy in #1136
[BugFix] alloc_var init failed to handle complex expression by @kurisu6912 in #1144
[Refactor] Remove amd gemm_v2 tests by @LeiWang1999 in #1149
[BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values by @Rachmanino in #1143
[Bugfix] Implement classic arena algorithm for shmem merge and WAW conflict detection by @LeiWang1999 in #1146
[CI] allow dirty workspace for format.sh and introduce loop carry thread sync unit test by @LeiWang1999 in #1153
[CI] use Python urllib to download file instead of Wget by @XuehaiPan in #1154
[BugFix] Correct direct copy from bf16 to fp8 by @Cunxiao2002 in #1090
[Refactor]:Move device_assert from extern_call to intrin_call by @yyttt6 in #1134
[Enhancement] Enhance Cast operations Vectorization by @LJC00118 in #1156
[Bugfix] Enhance LetStmt handling in Vectorize Loop Pass by @LeiWang1999 in #1159
[Release] Bump version to v0.1.6.post2 by @LeiWang1999 in #1160
[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi by @LeiWang1999 in #1108
[Bugfix] Enable code lowering with producer‑copy‑only program by @LeiWang1999 in #1168
[Bugfix] Support 16bits shfl_sync by @LeiWang1999 in #1169
[Testing] Move TMA 1D and test for its functionality by @tzj-fxz in #1167
[Refactor]: Change the params in pytest to avoid oom error during ci by @yyttt6 in #1170
[Bugfix] Fix tvm import path for editable build by @LeiWang1999 in #1172
[Language] Expose T.warpgroup_fence_operand for nvcc code motion by @LeiWang1999 in #986
[Language] Add Correctness and performance check scripts for V2 by @LeiWang1999 in #1174
[Bugfix] Legalize Datatype for mma intrinisc codegen by @LeiWang1999 in #1179
[CI]: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1177
[CI]: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1178
[Language] Initial version of tilelang frontend v2 by @kurisu6912 in #1120
[Fix] fix type imcompatible error in #1115 by @kurisu6912 in #1180
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1183
[Fix] Remove unsupported type params by @kurisu6912 in #1186
[Feature] Enhance fill operation to support various buffer types by @LeiWang1999 in #1189
[Refactor] Improve Python3.9 compatibility for ParamSpec and Self by @LeiWang1999 in #1190
[Feat] Add swap like grammar in tuple assignment by @kurisu6912 in #1185
[Release] Unify local build scripts to use cibuildwheel and reduce size of sdist by @oraluben in #1171
[Langauge] Support n>256 for v2 by @LeiWang1999 in #1182
[GQA] Use TMA in GQA bwd kernel to boost performance by @tzj-fxz in #1176
[Example] Update GQA varlen fwd by @chengyupku in #1173
[Refactor] Dynamic registration of FP8 data type for compatibility with older PyTorch versions by @LeiWang1999 in #1197
[Feature] Add tl.infinity operator for infinity handling of bfloat16 by @Rachmanino in #1175
[SM70] Refactor and minor fix for SM70 by @LeiWang1999 in #1195
[CI] Enable ccache for CIBW on Linux by @oraluben in #1184
[Feat] Add support for T.serial with step and negative step by @kurisu6912 in #1188
[Feat] Add A Pass to Handle Negative Index by @kurisu6912 in #1192
Fix type errors in reduce.h by @LJC00118 in #1204
[Bugfix] Improves the accuracy of dependency analysis in the storage access by @LeiWang1999 in #1205
[Bugfix][Language V2] Capture closure variables from program by @LeiWang1999 in #1206
Fix Dockerfile.cu128 by @createthis in #1208
[Enhancement] Improve handling of negative indices for ramp and broadcast node by @LeiWang1999 in #1207
[Bugfix] Enhane LetStmt Handling in Pipeline Transform by @LeiWang1999 in #1212
[Fix] Fix buffer re-import typo in tilelang.languge by @kurisu6912 in #1214
[Build] Explicitly add libtvm as a dep of libtilelang by @oraluben in #1215
[Utils] Add source export, NVCC-based PTX/SASS dump, logging by @LeiWang1999 in #1216
[Bugfix] Improve error handling in LayoutNode InverseWithLevel by @LeiWang1999 in #1220
[Enhancement] Improve iterator handling in layout utilities and parallel operations by @LeiWang1999 in #1221
[Language] Refactor reduce and support shared memory as its in/out by @LeiWang1999 in #1219
[GQA] Add varlen decoding kernel with logits saving by @tzj-fxz in #1223
[Enhancement] Add thread count validation for ReduceOp fragment layout inference by @LeiWang1999 in #1225
[Refactor] Simplify logic in the CompleteBufferFragment by @LeiWang1999 in #1226
[Refactor] Refactor version retrieval logic in tilelang package by @LeiWang1999 in #1227
[CPU] Minor fix for cpu backend by @LeiWang1999 in #1230
[Feature] Add Release Plan issue template by @LeiWang1999 in #1231
[Fix] Fix a type that make wrong T.macro backtrace by @kurisu6912 in #1234
[Refactor] Add kernel selection option for GEMM v1 in environment settings by @LeiWang1999 in #1200
[Bugfix] Minor fix in builder.py by @LJC00118 in #1235
[Language] Add type stubs for tir op by @kurisu6912 in #1239
[Enhancement] Support Layout/Fragment Reshape by @LeiWang1999 in #1241
[Bugfix] Minor fix for tcgen05 by @LeiWang1999 in #1242
RMSNorm epsilon refine in the example by @pengxin99 in #1243
[AMD] enable amd ci test & fix bug & fix dockerfile by @Paran0idy in #1244
[Refactor] Phaseout legacy loop vectorize dynamic pass by @LeiWang1999 in #1245
[Bugfix] Fix fp8 dtype for some cases by @LeiWang1999 in #1246
[Minor] Remove git_commit.txt by @SiriusNEO in #1249
[Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape by @LeiWang1999 in #1248
[Refactor] Update buffer handling in copy and atomic operations by @LeiWang1999 in #1247
[Language] Add missing while statement by @kurisu6912 in #1254
[BugFix] Add autotune and exp2 for GDN kernel by @tzj-fxz in #1258
[BugFix] Refactor attention kernel to handle OOB positions by filling with -inf instead of clearing accumulators. by @Rachmanino in #1222
[fix] NVRTC execution backend by @lucifer1004 in #1256
[AMD] Update CK for ROCm7 by @Paran0idy in #1262
[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd by @KevinZeng08 in #1260
[Example] Add GQA decoding kernel with varlen page table by @tzj-fxz in #1265
[Refactor] add support for numpy dtype conversion by @kurisu6912 in #1255
[EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability by @vpj in #1148
[Docs] Improve Installation Guide by @SiriusNEO in #1270
[Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity by @Rachmanino in #1269
[Bugfix] Fix multiple cg defination when using T.sync_grid by @chengyupku in #1272
[Minor] Remove from __future__ import annotations for python 3.8 by @oraluben in #1273
[BugFix] Adding extra parameters into autotune hashkey by @SiriusNEO in #1274
Fix various issues under int64_t static and dynamic shape. by @Elevator14B in #1218
Bug fix for Gated Delta Net benchmark script by @learning-chip in #1267
[Bugfix] Minor fix for some cases by @LeiWang1999 in #1278
[Language] Add shape check in T.view/reshape by @SiriusNEO in #1277
[FFI] Use tvm ffi as the default execution backend by @LeiWang1999 in #1259
[Bugfix] Supply missing T.print for bool type by @LeiWang1999 in #1279
[Fix] Fix memory leak bug by @kurisu6912 in #1281
[Enhancement] Enhance CUDA compilation by integrating pass context configuration by @LeiWang1999 in #1283
Fix the bug in issue #1266 by @sea-with-sakura in #1284
[Language][UX] Nested loop checker in pre-lowering stage by @SiriusNEO in #1288
[Compatibility] Support CUDA 11.3 by @LeiWang1999 in #1290
[Feat] Add support for using T.Tensor(n * 2 + 1) in function annotation by @kurisu6912 in #1285
[Feat] Add missing support to pass reference by T.Var annotation by @kurisu6912 in #1291
[Enhancement] Shared Memory Size Can be Dynamic by @LeiWang1999 in #1294
[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 by @kurisu6912 in #1305
[Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures by @LeiWang1999 in #1306
[Fix] Fix frame scope error in T.macro by @kurisu6912 in #1308
[WIP] support more dtypes for tcgen05 by @PannenetsF in #1229
Improve memory access safety and T.assume handling by @LJC00118 in #1292
[Bugfix] Fix autotune cache by @LeiWang1999 in #1315
[Refactor] Backup Analyzer to get the appropriate arith informations by @LeiWang1999 in #1311
Revert "[WIP] support more dtypes for tcgen05 (#1229)" by @LeiWang1999 in #1323
[CI]: Bump actions/checkout from 5 to 6 by @dependabot[bot] in #1319
[CI]: Bump pypa/cibuildwheel from 3.2 to 3.3 by @dependabot[bot] in #1318
[Installation] Fix building using customized TVM path by @SiriusNEO in #1326
[Release] Allow developer with write permission to trigger wheel release by @oraluben in #1322
[Feat] Support warp reduce by @Rachmanino in #1316
[Enhancement] Support more dtype in T.print by @xwhzz in #1329
[BugFix] Use BufferRegion in tl.cumsum to infer buffer shape by @SiriusNEO in #1321
[Fix] Fix uint narrowing bug in #1310 by @kurisu6912 in #1320
[Refactor] Disable strided buffer load inside tvm (#1301) by @kurisu6912 in #1332
[Refactor] Moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils by @LeiWang1999 in #1333
[Fix] Fix bug copying from or to local buffer (#1304) by @kurisu6912 in #1324
[Language][UX] Semantic check for parallel fragment access by @SiriusNEO in #1338
Add unit tests for T.assume by @LJC00118 in #1341
[Feat] Extend LegalizeNegativeIndex to support buffer store stmts by @ConvolutedDog in #1339
[Refactor] Phaseout vmap for Tile Operators by @LeiWang1999 in #1334
[Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 by @PannenetsF in #1327
[Refactor] Enhance CopyNode's IterVar Creation and Range Handling by @LeiWang1999 in #1346
[Fix] Fix missing not operator in frontend (#1347) by @kurisu6912 in #1348
[Enhancement] Add support for k_pack in gemm_mfma by @Gongen-Ali in #1344
Add sparse fine-tuning kernel for deepseek sparse attention to example by @hyx1999 in #1296
[Refactor] Improve assertion handling in CodeGenCHost and ArgBinder by @LeiWang1999 in #1352
[Refactor] Simplify index sign state handling in LegalizeNegativeIndex by @LeiWang1999 in #1354
[Enhancement] Improve error handling and assertion messages across runtime and argument binding by @LeiWang1999 in #1356
[Bugfix] Disable floordiv optimization due to integer overflow risk by @LJC00118 in #1355
[Bugfix] Fix the jit_kernel issue by @gfvvz in #1357
[Bugfix] Bind thread range for fragment inference in Parallel strict layout inference stage. by @LeiWang1999 in #1359
[Analysis] Enhance NestedLoopChecker with tile op cases by @SiriusNEO in #1358
[Language] support T.gemm_sp_v2 on sm80 and sm89 by @botbw in #1056
[Bugfix] Update TIR registration for GemmSPPy to use tile operation by @LeiWang1999 in #1361
[Enhancement] Implement dynamic unroll factor in CUDA code generation by @LeiWang1999 in #1360
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1362
[Bugfix] Remove debug print in PyStmtFunctionVisitor by @LeiWang1999 in #1363
[Debug] Always include line info in NVCC command for improved profiling by @LeiWang1999 in #1364
[Enhancemnet] Minor fix to speed up testing by @LeiWang1999 in #1365
[Enhancement] Add DISABLE_CACHE environment variables by @SiriusNEO in #1368
[Refactor]: Remove useless include in atomicadd_vectorize.h by @yyttt6 in #1371
[Refactor] Generalize fp8 process by @LeiWang1999 in #1372
[Layout] Enhance Free Layout Inference by @LeiWang1999 in #1375
[Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations by @LeiWang1999 in #1376
[Tool] Provide layout visualization tool by @Cunxiao2002 in #1353
[Release] Relax constraint of tvm-ffi to compatible version by @oraluben in #1373
[Language] Tilelang LazyJIT Experimental Version by @kurisu6912 in #1337
[Builder] Enhance variable name binding and scope management by @LeiWang1999 in #1378
[Bugfix] make cuda driver api compat with cuda12/13, along with tests by @PannenetsF in #1379
[Fix] typo in cuda attr by @PannenetsF in #1380
[Language V2] Minor fix for complex annotations by @LeiWang1999 in #1381
[Release] Bump Version into 0.1.7 by @LeiWang1999 in #1377
[Typing] Enhance compatibility for advanced typing features for Py39 by @LeiWang1999 in #1382

New Contributors

@LJC00118 made their first contribution in #880
@Edenzzzz made their first contribution in #900
@zjudmd1015 made their first contribution in #921
@lijinpei made their first contribution in #934
@Zhichenzzz made their first contribution in #919
@BBuf made their first contribution in #945
@XuehaiPan made their first contribution in #950
@iloveai8086 made their first contribution in #957
@Degeneracy-Evil made their first contribution in #976
@pre-commit-ci[bot] made their first contribution in #1183
@createthis made their first contribution in #1208
@pengxin99 made their first contribution in #1243
@KevinZeng08 made their first contribution in #1260
@vpj made their first contribution in #1148
@Elevator14B made their first contribution in #1218
@learning-chip made their first contribution in #1267
@sea-with-sakura made their first contribution in #1284
@PannenetsF made their first contribution in #1229
@ConvolutedDog made their first contribution in #1339
@Gongen-Ali made their first contribution in #1344

Full Changelog: 0.1.6...v0.1.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.7

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!