Skip to content

Releases: vllm-project/tpu-inference

v0.13.2

30 Dec 09:05

Choose a tag to compare

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Ironwood Support All relevant dependencies have been rolled up to support Ironwood (v7x). CI/CD has been updated to reflect this change.

For further information on different build requirements for v7x compared to previous TPU generations (v6e and prior), please see the following documentation:
QuickStart
TPU Setup

P/D Disaggregated Serving over DCN Ray-based prefill/decode disaggregation with KV cache transfer over DCN.

Multi-Lora for Pytorch Models Multi-Lora support has landed for Pytorch model definitions from vLLM. JAX-native solution will be supported shortly.

Run:AI Model Streaming Run:AI model streamer is a direct Google Cloud Storage model download accelerator. It has been demonstrated to be the easiest and fastest way to pull models into GPU memory from GCS. We are now providing customers the same experience on TPUs.

What's Changed

Read more

v0.12.0

12 Dec 20:28

Choose a tag to compare

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Async Scheduler Enabled the async-scheduler in tpu-inference for improved performance on smaller models.

Spec Decoder EAGLE-3 Added support EAGLE-3 variant with verified performance for Llama 3.1-8B.

Out-of-Tree Model Support Load custom JAX models as plugins, enabling users to serve custom model architectures without forking or modifying vLLM internals.

Automated CI/CD and Pre-merge Check Improved the testing and validation pipeline with automated CI/CD and pre-merge checks to enhance stability and accelerate iteration. More improvements to come.

What's Changed

Read more