Skip to content

v0.1.1: Faster Token Generation ⚡

Latest

Choose a tag to compare

@lcy-seso lcy-seso released this 23 Dec 13:06
20a862c

🚀 TileRT v0.1.1 – Ultra-Low-Latency Token Generation

TileRT v0.1.1 delivers a significant boost in token generation performance, reducing latency by 35% compared to the previous release.

This improvement is achieved through optimizations to core operators and enhancements to the tile-level runtime engine. Key updates include faster GEMV kernels, expanded FP8/BF16 support across multiple kernels, and improved runtime scheduling and memory behavior.

✨ Highlights

  • Performance Boost: Token generation is now significantly faster, with latency reduced by around 35%. See our latest speed tests for exact figures.
  • Operator & Precision Optimizations: Faster GEMV, RMSNorm, and MMA-based operators with expanded FP8/BF16 support.
  • Runtime Enhancements: Improved tile-level scheduling, prefetching, memory alignment, and multi-device task handling.
  • Stability Fixes: Resolved issues affecting runtime stability and memory behavior.

🔧 What’s Changed

🚀 Performance & Operators

  • Optimized GEMV and RMSNorm operators for improved performance.
  • Expanded FP8/BF16 support across multiple kernels.
  • Improved expert selection performance.

⚙️ Runtime & Kernel Execution

  • Enhanced tile-level runtime engine for better scheduling, prefetching, and memory management.
  • Fixed shared memory alignment issues and inter-operator dependencies.

🔮 Looking Ahead

TileRT is under active development. The next release and upcoming work will focus on:

  • Further latency reductions in token generation.
  • Introduction of new features, including MTP support.
  • Opening the weight converter, enabling decoupled layouts and more flexible kernel optimizations.

With ongoing refactoring and continuous enhancements to operators and the runtime engine, we invite the community to follow our progress, test new features, and provide feedback to help shape the future development of TileRT.