Skip to content

Releases: triton-lang/triton

Triton 3.4.0 Release

30 Jul 20:47
c817b9b
Compare
Choose a tag to compare

Highlights

Gluon Framework Comprehensive Enhancement

The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)

Hardware Support Expansion

  • AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
  • Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
  • Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)

Performance Optimizations

  • Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
  • MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
  • TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)

New Features

Language and Frontend

  • Aggregate Type Support - Added @tl.aggregate decorator for autogenerating Triton types from Python classes (#6970)
  • JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
  • Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
  • Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
  • Masked Histograms - Added support for masked histogram operations (#6695)
  • Syntactic Sugar Additions - Added .item() as syntactic sugar for .reshape([]) (#6873)

Backend and Compilation

  • Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
  • Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
  • TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
  • Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
  • BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
  • Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)

Hardware-Specific Features

  • AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
  • AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
  • AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
  • NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
  • AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)

Improvements

Performance

  • Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
  • Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
  • SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
  • Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)

Developer Experience

  • Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
  • Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
  • Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
  • Build System Improvements - Better CMake support and dependency management ([#6330](https://github.com/tri...
Read more