v0.1.7
What's Changed
- [PATCH] Static libg++ linking fix by @LeiWang1999 in #854
- [Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis by @LeiWang1999 in #856
- [Doc] Optimize the quickstart guide for clarity and not just for CUDA by @LeiWang1999 in #858
- [TMA] Bugfix when a shared buffer is both issued with tma store and tma load by @LeiWang1999 in #857
- [AMD][MLA] Fix mla autotune for rocm by @LeiWang1999 in #861
- [Bugfix] Ensure correct handling for cases where
seq_q<seq_kvin flash attention examples by @Rachmanino in #864 - [AMD] refactor MatrixCoreIntrinEmitter by @Paran0idy in #860
- [Feat] Add fast sine and cosine definitions in CUDA templates by @Rachmanino in #865
- [Layout] Support layout forward with multi dimension by @LeiWang1999 in #867
- [Autotune][Conv] optimize convolution examples to use autotune by @LeiWang1999 in #866
- [Example] Add examples to support efficient attention sink forward process by @Rachmanino in #853
- [Parser] Adapt Parser to work with Python 3.8 in some cases by @LeiWang1999 in #869
- [Fix] Fix bug 0905: tilelang doesn't vectorize
B[i,j] = c[i] + A[i,j]by @kurisu6912 in #798 - [Language] Support sequence comparisons by @LeiWang1999 in #872
- [Language] Support loop_break primitive by @chengyupku in #873
- [Bugfix] Use
ExprDeepEqualinstead ofStructuralEqualwhen merge consecutive If stmt by @LeiWang1999 in #876 - [Language] Support atomic add with ret by @LeiWang1999 in #870
- [Cython] Remove an incorrect check by @LJC00118 in #880
- Update amd_ci.yml by @Alex4210987 in #881
- [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke by @LeiWang1999 in #875
- [Example] Add efficient attention sink backward implementations and tests by @Rachmanino in #877
- [Precision] Introduce
T.ieee_rsqrtand related high precision op by @LeiWang1999 in #882 - [Dist] Provide an option to include commit ID in version by @LeiWang1999 in #884
- [Example] Optimize sink attention forward via swizzled layout and report benchmark results by @Rachmanino in #885
- [Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop by @LeiWang1999 in #844
- [Bugfix][Enhancement] Fix a bug in previous commit and enhance cuda backend by @Hamerlate in #887
- [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst by @Rachmanino in #888
- [Layout] Fix plot layout by @Paran0idy in #890
- [Example] Add example by @LeiWang1999 in #894
- [News] Add announcement of support for Huawei Ascend chips by @xwhzz in #895
- [Example] Add sparse mla examples by @LeiWang1999 in #896
- [Typo] Fix backend name for Huawei Ascend by @xwhzz in #898
- [CI] Legalize math related test by @LeiWang1999 in #899
- [Bugfix] Fix flops comp and softmax scale in mla by @Edenzzzz in #900
- [Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples by @LeiWang1999 in #913
- [CI] optimize CI time for sparse gemm by @botbw in #906
- [Enhancement] Include compile flags into the hash key of cached kernels by @Rachmanino in #911
- [Bugfix] Fix saving kernel source code where JITKernel.artifact is None by @zjudmd1015 in #921
- [CI] Refactor import paths in dequantization examples to use dequantize_utils by @LeiWang1999 in #914
- [Example] Add MLA decode ws example by @chengyupku in #928
- [CI] Fix documentation runner by adding 'nvidia' tag by @xwhzz in #927
- [Layout] Strict annotate completed replicated layout for fragment with constant index by @LeiWang1999 in #929
- [Bugfix] Fix tensor memory copy layout by @Hamerlate in #933
- [Example] Optimize online_softmax example by @lijinpei in #934
- [Example] Add correctness assert into dsa example by @LeiWang1999 in #937
- [Enhancement] Enhance and add new GQA backward examples for Hopper by @Rachmanino in #930
- [Enhancement] Fix lint to improve grouped GEMM performance with TMA by @Cunxiao2002 in #938
- [Example] Introduce split+sum template, and optimize
atomic_addperformance for bwd examples by @LeiWang1999 in #940 - [Example] Disable TMA and enable FastMath for NSA Examples (#941) by @LeiWang1999 in #941
- [Example] Revert the atomic/split&sum templates in MHA backward examples by @Rachmanino in #943
- [Example] Add sparse mla bwd example for deepseek_v32 by @Zhichenzzz in #919
- [Profiler]Adds CUPTI profiler support by @Cunxiao2002 in #936
- [Enhancement] Support Copy for Buffer Load witih scalar indices by @LeiWang1999 in #946
- [Code Style] Refine nvrtc compile related check style by @BBuf in #945
- [Backend] Add metal backend by @oraluben in #799
- [CI] enable dependabot for GHA workflows by @XuehaiPan in #950
- Modify the SM architecture number to support Thor’s sm110. by @iloveai8086 in #957
- [CI] auto-cancel in-progress PR CI when new commits are pushed by @XuehaiPan in #956
- [bug] fix type object is not subscriptable in py38 by @BBuf in #959
- [Bugfix][Doc] Add astroid version constraint to requirements.txt by @xwhzz in #958
- [CI]: Bump actions/setup-python from 2 to 6 by @dependabot[bot] in #951
- [CI]: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #952
- [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #954
- [CI]: Bump actions/checkout from 2 to 5 by @dependabot[bot] in #953
- [TileOp] Implement WGMMA for T.gemm_v2 by @LeiWang1999 in #813
- [Docs] add CODE_OF_CONDUCT.md by @XuehaiPan in #965
- [Example] Add support for
bfloat16and user-definedsm_scalein attention sink examples by @Rachmanino in #924 - [Bugfix] Do not force inline let stmt by @LeiWang1999 in #947
- [CI] add
pre-commitintegration by @XuehaiPan in #955 - [Doc] Install docs add docker install method by @BBuf in #961
- [Bugfix] Fix dummy kernel compliation by @SiriusNEO in #962
- [CI][Refactor] Refactor non-test CI workflow files by @XuehaiPan in #971
- [TileOp] Implememt
CumSum1Dby @LeiWang1999 in #978 - [Language] Enhance
T.alloc_varfor AugAssign and AnnAsign by @LeiWang1999 in #979 - [Refactor] Refactor Pass
InjectFenceProxyand expose some warp group primitives in frontend by @LeiWang1999 in #977 - [Typo] Remove debug print by @LeiWang1999 in #980
- [Bugfix] Use
access_ptr("r")instead ofaccess_ptr("w")for correct pipeline analysis by @LeiWang1999 in #983 - [Feature][Example] Support TMA reduce operation and update GQA bwd example by @chengyupku in #969
- [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) by @Degeneracy-Evil in #976
- [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures by @tzj-fxz in #984
- [Bugfix] Fallback
torch.accelerator.synchronize()totorch.cuda.synchronize()by @yyttt6 in #987 - [Bugfix]:Fix atomicadd auto vectorize identify var error by @yyttt6 in #883
- [CI] Speed up sparse tensor core test via vectorized generating sparse data by @LeiWang1999 in #1009
- [Build] Migrate to scikit-build-core by @oraluben in #939
- [CI] Removes redundant environment variable by @Cunxiao2002 in #1020
- [Transform] Migrate
LowerIntrinfrom tvm into tilelang by @LeiWang1999 in #999 - [Lint] Prefer American English spelling by @XuehaiPan in #1022
- [Build] Prefer libs from local build dir by @oraluben in #1027
- [Language] Support Consequential assignments like 'a = b = c = 1' by @LeiWang1999 in #992
- [CI] Removes debug print statements from the example. by @Cunxiao2002 in #1030
- [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation by @Rachmanino in #1023
- [Bugfix] Recover code for flexible parallel by @LeiWang1999 in #1032
- [CI] Disable buggy(maybe) warp specialized kernel ci test for H20 by @LeiWang1999 in #1033
- [TIR] Revert some changes of Pass
LowerIntrinby @LeiWang1999 in #1035 - [Env] Optimize the mechanism for locating
TL_LIBSby @LeiWang1999 in #1038 - [CUDA] Add pack functions for FP8 types by @LJC00118 in #967
- [Language] Expose
T.get_warp_idx_syncandT.shuffle_electfor efficient thread election by @LeiWang1999 in #989 - [AMD] fix bug&add amd fp8 examples by @Alex4210987 in #966
- [CI][Refactor] Merge test CI workflow files into one by @XuehaiPan in #973
- [BugFix] Phaseout dependency of Triton in sink examples to make CI happy by @Rachmanino in #1045
- [Refactor] Use
has_simt_copyto decide whether to insertset_max_nregby @chengyupku in #982 - [Feature]: Add test for atomicadd auto vectorize and remove useless code by @yyttt6 in #1019
- Allow mma gemm for all cuda arch by @oraluben in #1047
- [Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. by @LeiWang1999 in #1051
- [CI] Fix ROCm CI by @XuehaiPan in #1043
- [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper by @Rachmanino in #1024
- Automatically initialize submodule if missing by @LeiWang1999 in #1052
- [Enhancement] Remove constraint requiring last dimension stride to be 1 by @LJC00118 in #1040
- [CI] Disable autofix for pre-commit CI by @LeiWang1999 in #1053
- [Enhancement] Improve CUDA compiler detection in CMake by @LJC00118 in #1054
- [Enhancement] Introduce a workaround for layout inference for local buffer store by @LeiWang1999 in #1055
- [Refactor] Refactor Pass
LegalizeSafeMemoryAccessto support recursive load/store rewrite by @SiriusNEO in #1050 - Making version parser more robust against missing or unavailable metadata by @LeiWang1999 in #1061
- [DOC] Add document for develop with PYTHONPATH by @LeiWang1999 in #1062
- [CI]:Reduce test shapes to avoid OOM errors during CI. by @yyttt6 in #1060
- [Benchmark] Add H800 SXM Benchmark results by @LeiWang1999 in #1063
- [Misc] Add GitHub issue templates by @XuehaiPan in #1057
- [Refactor][Example] Update linear attention examples and add tests by @Rachmanino in #1010
- [Enhancement] Deprecate split&sum in attn bwd examples on Hopper by @Rachmanino in #1065
- [Benchmark] Add matmul FP16 benchmark results by @LeiWang1999 in #1067
- [CI]: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #1070
- [Example] Update GQA varlen fwd and MHA varlen fwd by @chengyupku in #1071
- [Parallel] Support
T.Parallelwith dynamic extents by @LeiWang1999 in #990 - [Layout] Utilizing IsEqual instead of StructuralEqual by @LeiWang1999 in #1073
- [Cache] raise errors for
tileang.clear_cache()by @LeiWang1999 in #1077 - [Feature] Support Reduce operators for bitwise and/or/xor by @tzj-fxz in #1074
- [Autotune] Add autotune coverage for symbolic M and normalize cache key by @LeiWang1999 in #1075
- [Language] Recommend using
T.dynamicinstead ofT.symbolicby @LeiWang1999 in #1076 - [Language] Efficient
T.reduce_with shared memory input/output by @LeiWang1999 in #1080 - [Bugfix] Fix missing reg alloc in custom warp specialization by @chengyupku in #1084
- [Enhancement] Update async intrinsic handling in inject_fence_proxy by @Rachmanino in #1068
- [Feature] Add GQA backward kernel with varlen input by @tzj-fxz in #1082
- [BugFix] Add memory order argument for non-vectorized atomic add by @tzj-fxz in #1081
- [Refactor] Rename cython output to
tilelang_cythonand relocate its path by @LeiWang1999 in #1086 - [Target] Enhance target selection helpers and documentation by @LeiWang1999 in #1085
- [Cleanup] Remove
tilelang.disable_cache()calls from examples and tests by @Rachmanino in #1088 - [PassConfig] Introduce PassConfig
TL_STORAGE_REWRITE_DETECT_INPLACEby @LeiWang1999 in #1089 - [Language] Support tilelang
alloc_var(dtype, init=x)by @LeiWang1999 in #1092 - [Bugfix] Fix missing host
cuTensorMapEncodeIm2colcall by @chengyupku in #1094 - [GQA] Add regional atomic add to slightly boost performance by @tzj-fxz in #1093
- [Example] Add block level high performance gemv example by @LeiWang1999 in #1097
- [Refactor] Optimize debug message for parallel inference by @LeiWang1999 in #1096
- [CI][Lint] Retire
format.shand addclang-tidyto GHA workflow by @XuehaiPan in #1044 - [Refactor] Use forceinline in
ldmatrixand update mamba scan kernel by @chengyupku in #1104 - [Maint] Update uncommitted change detection command in
format.shby @XuehaiPan in #1102 - [Benchmark] Add Mamba2_chunk_scan benchmark by @chengyupku in #1109
- [Benchmark] Update Mamba2_chunk_scan benchmark by @chengyupku in #1110
- [Lint] Enable pyupgrade linter in ruff by @oraluben in #963
- [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi by @LeiWang1999 in #1111
- [Feature] Enhance vectorized conversion support in CUDA codegen by @Rachmanino in #1095
- [Feature] Support None type as input for
T.ptrandT.Tensorby @xwhzz in #1114 - [Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) by @LeiWang1999 in #1119
- [Feature] Add memory_order PTX for vectorized atomic add by @tzj-fxz in #1112
- [CI]: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1128
- [CI]: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1127
- [Enhancement] Add missing
fence_barrier_initprimitive after mbarrier init by @chengyupku in #1121 - [Feature]:Add device assert by @yyttt6 in #1116
- [Build][CI] Build and test SDist in release CI by @XuehaiPan in #1098
- [Benchmark] Update triton and helion baselines in mamba-chuk-scan by @chengyupku in #1131
- Add int2 and longlong4 pack functions by @LJC00118 in #1129
- [BugFix] Add memory order and testing script for split version GQA bwd kernel by @tzj-fxz in #1100
- [Bugfix] Correctly construct the argument list for atomic add based on the vector size by @LeiWang1999 in #1137
- [AMD] Supoort T.gemm_v2 for AMD Backend by @Paran0idy in #1136
- [BugFix] alloc_var init failed to handle complex expression by @kurisu6912 in #1144
- [Refactor] Remove amd gemm_v2 tests by @LeiWang1999 in #1149
- [BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values by @Rachmanino in #1143
- [Bugfix] Implement classic arena algorithm for shmem merge and WAW conflict detection by @LeiWang1999 in #1146
- [CI] allow dirty workspace for
format.shand introduce loop carry thread sync unit test by @LeiWang1999 in #1153 - [CI] use Python urllib to download file instead of Wget by @XuehaiPan in #1154
- [BugFix] Correct direct copy from bf16 to fp8 by @Cunxiao2002 in #1090
- [Refactor]:Move device_assert from extern_call to intrin_call by @yyttt6 in #1134
- [Enhancement] Enhance Cast operations Vectorization by @LJC00118 in #1156
- [Bugfix] Enhance LetStmt handling in Vectorize Loop Pass by @LeiWang1999 in #1159
- [Release] Bump version to v0.1.6.post2 by @LeiWang1999 in #1160
- [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi by @LeiWang1999 in #1108
- [Bugfix] Enable code lowering with producer‑copy‑only program by @LeiWang1999 in #1168
- [Bugfix] Support 16bits shfl_sync by @LeiWang1999 in #1169
- [Testing] Move TMA 1D and test for its functionality by @tzj-fxz in #1167
- [Refactor]: Change the params in pytest to avoid oom error during ci by @yyttt6 in #1170
- [Bugfix] Fix tvm import path for editable build by @LeiWang1999 in #1172
- [Language] Expose
T.warpgroup_fence_operandfor nvcc code motion by @LeiWang1999 in #986 - [Language] Add Correctness and performance check scripts for V2 by @LeiWang1999 in #1174
- [Bugfix] Legalize Datatype for mma intrinisc codegen by @LeiWang1999 in #1179
- [CI]: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1177
- [CI]: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1178
- [Language] Initial version of tilelang frontend v2 by @kurisu6912 in #1120
- [Fix] fix type imcompatible error in #1115 by @kurisu6912 in #1180
- [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1183
- [Fix] Remove unsupported type params by @kurisu6912 in #1186
- [Feature] Enhance fill operation to support various buffer types by @LeiWang1999 in #1189
- [Refactor] Improve Python3.9 compatibility for ParamSpec and Self by @LeiWang1999 in #1190
- [Feat] Add swap like grammar in tuple assignment by @kurisu6912 in #1185
- [Release] Unify local build scripts to use
cibuildwheeland reduce size of sdist by @oraluben in #1171 - [Langauge] Support n>256 for v2 by @LeiWang1999 in #1182
- [GQA] Use TMA in GQA bwd kernel to boost performance by @tzj-fxz in #1176
- [Example] Update GQA varlen fwd by @chengyupku in #1173
- [Refactor] Dynamic registration of FP8 data type for compatibility with older PyTorch versions by @LeiWang1999 in #1197
- [Feature] Add
tl.infinityoperator for infinity handling of bfloat16 by @Rachmanino in #1175 - [SM70] Refactor and minor fix for SM70 by @LeiWang1999 in #1195
- [CI] Enable
ccachefor CIBW on Linux by @oraluben in #1184 - [Feat] Add support for
T.serialwith step and negative step by @kurisu6912 in #1188 - [Feat] Add A Pass to Handle Negative Index by @kurisu6912 in #1192
- Fix type errors in
reduce.hby @LJC00118 in #1204 - [Bugfix] Improves the accuracy of dependency analysis in the storage access by @LeiWang1999 in #1205
- [Bugfix][Language V2] Capture closure variables from program by @LeiWang1999 in #1206
- Fix Dockerfile.cu128 by @createthis in #1208
- [Enhancement] Improve handling of negative indices for ramp and broadcast node by @LeiWang1999 in #1207
- [Bugfix] Enhane LetStmt Handling in Pipeline Transform by @LeiWang1999 in #1212
- [Fix] Fix buffer re-import typo in tilelang.languge by @kurisu6912 in #1214
- [Build] Explicitly add
libtvmas a dep oflibtilelangby @oraluben in #1215 - [Utils] Add source export, NVCC-based PTX/SASS dump, logging by @LeiWang1999 in #1216
- [Bugfix] Improve error handling in LayoutNode InverseWithLevel by @LeiWang1999 in #1220
- [Enhancement] Improve iterator handling in layout utilities and parallel operations by @LeiWang1999 in #1221
- [Language] Refactor reduce and support shared memory as its in/out by @LeiWang1999 in #1219
- [GQA] Add varlen decoding kernel with logits saving by @tzj-fxz in #1223
- [Enhancement] Add thread count validation for ReduceOp fragment layout inference by @LeiWang1999 in #1225
- [Refactor] Simplify logic in the
CompleteBufferFragmentby @LeiWang1999 in #1226 - [Refactor] Refactor version retrieval logic in tilelang package by @LeiWang1999 in #1227
- [CPU] Minor fix for cpu backend by @LeiWang1999 in #1230
- [Feature] Add Release Plan issue template by @LeiWang1999 in #1231
- [Fix] Fix a type that make wrong T.macro backtrace by @kurisu6912 in #1234
- [Refactor] Add kernel selection option for GEMM v1 in environment settings by @LeiWang1999 in #1200
- [Bugfix] Minor fix in
builder.pyby @LJC00118 in #1235 - [Language] Add type stubs for tir op by @kurisu6912 in #1239
- [Enhancement] Support Layout/Fragment Reshape by @LeiWang1999 in #1241
- [Bugfix] Minor fix for tcgen05 by @LeiWang1999 in #1242
- RMSNorm epsilon refine in the example by @pengxin99 in #1243
- [AMD] enable amd ci test & fix bug & fix dockerfile by @Paran0idy in #1244
- [Refactor] Phaseout legacy loop vectorize dynamic pass by @LeiWang1999 in #1245
- [Bugfix] Fix fp8 dtype for some cases by @LeiWang1999 in #1246
- [Minor] Remove git_commit.txt by @SiriusNEO in #1249
- [Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape by @LeiWang1999 in #1248
- [Refactor] Update buffer handling in copy and atomic operations by @LeiWang1999 in #1247
- [Language] Add missing while statement by @kurisu6912 in #1254
- [BugFix] Add autotune and exp2 for GDN kernel by @tzj-fxz in #1258
- [BugFix] Refactor attention kernel to handle OOB positions by filling with
-infinstead of clearing accumulators. by @Rachmanino in #1222 - [fix] NVRTC execution backend by @lucifer1004 in #1256
- [AMD] Update CK for ROCm7 by @Paran0idy in #1262
- [BugFix] Remove memory_order in atomic constexpr and fix NSA bwd by @KevinZeng08 in #1260
- [Example] Add GQA decoding kernel with varlen page table by @tzj-fxz in #1265
- [Refactor] add support for numpy dtype conversion by @kurisu6912 in #1255
- [EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability by @vpj in #1148
- [Docs] Improve Installation Guide by @SiriusNEO in #1270
- [Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity by @Rachmanino in #1269
- [Bugfix] Fix multiple cg defination when using T.sync_grid by @chengyupku in #1272
- [Minor] Remove
from __future__ import annotationsfor python 3.8 by @oraluben in #1273 - [BugFix] Adding extra parameters into autotune hashkey by @SiriusNEO in #1274
- Fix various issues under
int64_tstatic and dynamic shape. by @Elevator14B in #1218 - Bug fix for Gated Delta Net benchmark script by @learning-chip in #1267
- [Bugfix] Minor fix for some cases by @LeiWang1999 in #1278
- [Language] Add shape check in
T.view/reshapeby @SiriusNEO in #1277 - [FFI] Use tvm ffi as the default execution backend by @LeiWang1999 in #1259
- [Bugfix] Supply missing
T.printfor bool type by @LeiWang1999 in #1279 - [Fix] Fix memory leak bug by @kurisu6912 in #1281
- [Enhancement] Enhance CUDA compilation by integrating pass context configuration by @LeiWang1999 in #1283
- Fix the bug in issue #1266 by @sea-with-sakura in #1284
- [Language][UX] Nested loop checker in pre-lowering stage by @SiriusNEO in #1288
- [Compatibility] Support CUDA 11.3 by @LeiWang1999 in #1290
- [Feat] Add support for using
T.Tensor(n * 2 + 1)in function annotation by @kurisu6912 in #1285 - [Feat] Add missing support to pass reference by
T.Varannotation by @kurisu6912 in #1291 - [Enhancement] Shared Memory Size Can be Dynamic by @LeiWang1999 in #1294
- [Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 by @kurisu6912 in #1305
- [Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures by @LeiWang1999 in #1306
- [Fix] Fix frame scope error in T.macro by @kurisu6912 in #1308
- [WIP] support more dtypes for tcgen05 by @PannenetsF in #1229
- Improve memory access safety and
T.assumehandling by @LJC00118 in #1292 - [Bugfix] Fix autotune cache by @LeiWang1999 in #1315
- [Refactor] Backup Analyzer to get the appropriate arith informations by @LeiWang1999 in #1311
- Revert "[WIP] support more dtypes for tcgen05 (#1229)" by @LeiWang1999 in #1323
- [CI]: Bump actions/checkout from 5 to 6 by @dependabot[bot] in #1319
- [CI]: Bump pypa/cibuildwheel from 3.2 to 3.3 by @dependabot[bot] in #1318
- [Installation] Fix building using customized TVM path by @SiriusNEO in #1326
- [Release] Allow developer with write permission to trigger wheel release by @oraluben in #1322
- [Feat] Support warp reduce by @Rachmanino in #1316
- [Enhancement] Support more dtype in
T.printby @xwhzz in #1329 - [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape by @SiriusNEO in #1321
- [Fix] Fix uint narrowing bug in #1310 by @kurisu6912 in #1320
- [Refactor] Disable strided buffer load inside tvm (#1301) by @kurisu6912 in #1332
- [Refactor] Moving
NormalizeToBufferRegionandMakeAccessPtrFromRegionto utils by @LeiWang1999 in #1333 - [Fix] Fix bug copying from or to local buffer (#1304) by @kurisu6912 in #1324
- [Language][UX] Semantic check for parallel fragment access by @SiriusNEO in #1338
- Add unit tests for T.assume by @LJC00118 in #1341
- [Feat] Extend LegalizeNegativeIndex to support buffer store stmts by @ConvolutedDog in #1339
- [Refactor] Phaseout vmap for Tile Operators by @LeiWang1999 in #1334
- [Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 by @PannenetsF in #1327
- [Refactor] Enhance CopyNode's IterVar Creation and Range Handling by @LeiWang1999 in #1346
- [Fix] Fix missing
notoperator in frontend (#1347) by @kurisu6912 in #1348 - [Enhancement] Add support for k_pack in gemm_mfma by @Gongen-Ali in #1344
- Add sparse fine-tuning kernel for deepseek sparse attention to example by @hyx1999 in #1296
- [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder by @LeiWang1999 in #1352
- [Refactor] Simplify index sign state handling in LegalizeNegativeIndex by @LeiWang1999 in #1354
- [Enhancement] Improve error handling and assertion messages across runtime and argument binding by @LeiWang1999 in #1356
- [Bugfix] Disable floordiv optimization due to integer overflow risk by @LJC00118 in #1355
- [Bugfix] Fix the jit_kernel issue by @gfvvz in #1357
- [Bugfix] Bind thread range for fragment inference in Parallel strict layout inference stage. by @LeiWang1999 in #1359
- [Analysis] Enhance NestedLoopChecker with tile op cases by @SiriusNEO in #1358
- [Language] support
T.gemm_sp_v2on sm80 and sm89 by @botbw in #1056 - [Bugfix] Update TIR registration for GemmSPPy to use tile operation by @LeiWang1999 in #1361
- [Enhancement] Implement dynamic unroll factor in CUDA code generation by @LeiWang1999 in #1360
- [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1362
- [Bugfix] Remove debug print in PyStmtFunctionVisitor by @LeiWang1999 in #1363
- [Debug] Always include line info in NVCC command for improved profiling by @LeiWang1999 in #1364
- [Enhancemnet] Minor fix to speed up testing by @LeiWang1999 in #1365
- [Enhancement] Add DISABLE_CACHE environment variables by @SiriusNEO in #1368
- [Refactor]: Remove useless include in atomicadd_vectorize.h by @yyttt6 in #1371
- [Refactor] Generalize fp8 process by @LeiWang1999 in #1372
- [Layout] Enhance Free Layout Inference by @LeiWang1999 in #1375
- [Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations by @LeiWang1999 in #1376
- [Tool] Provide layout visualization tool by @Cunxiao2002 in #1353
- [Release] Relax constraint of tvm-ffi to compatible version by @oraluben in #1373
- [Language] Tilelang LazyJIT Experimental Version by @kurisu6912 in #1337
- [Builder] Enhance variable name binding and scope management by @LeiWang1999 in #1378
- [Bugfix] make cuda driver api compat with cuda12/13, along with tests by @PannenetsF in #1379
- [Fix] typo in cuda attr by @PannenetsF in #1380
- [Language V2] Minor fix for complex annotations by @LeiWang1999 in #1381
- [Release] Bump Version into 0.1.7 by @LeiWang1999 in #1377
- [Typing] Enhance compatibility for advanced typing features for Py39 by @LeiWang1999 in #1382
New Contributors
- @LJC00118 made their first contribution in #880
- @Edenzzzz made their first contribution in #900
- @zjudmd1015 made their first contribution in #921
- @lijinpei made their first contribution in #934
- @Zhichenzzz made their first contribution in #919
- @BBuf made their first contribution in #945
- @XuehaiPan made their first contribution in #950
- @iloveai8086 made their first contribution in #957
- @Degeneracy-Evil made their first contribution in #976
- @pre-commit-ci[bot] made their first contribution in #1183
- @createthis made their first contribution in #1208
- @pengxin99 made their first contribution in #1243
- @KevinZeng08 made their first contribution in #1260
- @vpj made their first contribution in #1148
- @Elevator14B made their first contribution in #1218
- @learning-chip made their first contribution in #1267
- @sea-with-sakura made their first contribution in #1284
- @PannenetsF made their first contribution in #1229
- @ConvolutedDog made their first contribution in #1339
- @Gongen-Ali made their first contribution in #1344
Full Changelog: 0.1.6...v0.1.7