Releases · vllm-project/tpu-inference

30 Dec 09:05

CienetStingLin

v0.13.2

bf58440

v0.13.2 Latest

Latest

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Ironwood Support All relevant dependencies have been rolled up to support Ironwood (v7x). CI/CD has been updated to reflect this change.

For further information on different build requirements for v7x compared to previous TPU generations (v6e and prior), please see the following documentation:
QuickStart
TPU Setup

P/D Disaggregated Serving over DCN Ray-based prefill/decode disaggregation with KV cache transfer over DCN.

Multi-Lora for Pytorch Models Multi-Lora support has landed for Pytorch model definitions from vLLM. JAX-native solution will be supported shortly.

Run:AI Model Streaming Run:AI model streamer is a direct Google Cloud Storage model download accelerator. It has been demonstrated to be the easiest and fastest way to pull models into GPU memory from GCS. We are now providing customers the same experience on TPUs.

What's Changed

[Misc] Update tpu-info by @kyuyeunk in #1214
fix: update nightly date format to YYYYMMDD by @ylangtsou in #1213
[Kernel] Remove KV masking by performing full bkv fetches in the first 2 steps by @yaochengji in #1240
Refactor moe codebase by @kyuyeunk in #1199
[multihost] Add NEW_MODEL_DESIGN to additional_env_vars by @Lumosis in #1236
[DP] Add model DP support for JAX GPT-OSS by @wenxindongwork in #1247
Fix circle reference which cause tpu_platform failed to import by @mrjunwan-lang in #1251
[Bug fix] Fix DP + Hybrid KV cache numerics by @wenxindongwork in #1249
add get_kv_connector_handshake_metadata in tpu_worker by @mrjunwan-lang in #1254
Integrate MLA v1 into DeepSeek-v3 by @gpolovets1 in #1190
Fix bug that PP assign wrong rank in distributed TP by @mrjunwan-lang in #1256
[Disagg] local disagg e2e test by @sixiang-google in #1237
Fix image tests. by @QiliangCui in #1253
Fixing a few failures in tests/test_quantization.py. by @gpolovets1 in #1258
[RPA] Revert previous changes due to numeric issue by @kyuyeunk in #1242
[Misc] Update torchax with fp4 support by @kyuyeunk in #1257
Update support matrices by @boe20211 in #1232
Update request_distribution in DP input preparation by @wenxindongwork in #1211
Fix FP8 dtype type mismatch issue by @helloworld1 in #1235
Add disagg test to v6e-8 queue by @sixiang-google in #1259
Add an argument to TpuPlatform.get_attn_backend_cls to adopt interfac… by @QiliangCui in #1263
Update README.md by @bvrockwell in #1197
Backward compatibility for NEW_MODEL_DESIGN=True by @wenxindongwork in #1267
Delete b/ from PR template. by @QiliangCui in #1268
[Disagg] Refined e2e test cleanup by @sixiang-google in #1265
Remove a branch with pl.when in fetching bkv by @rupengliu-meta in #1239
Add a lora perf test by @vanbasten23 in #1272
Fix moe layer from upstream change by @kyuyeunk in #1274
[RPA] Pipeline flash attention in default kernel by @jrplatin in #1203
First check-in to add ci/cd test on tpuv7x by @QiliangCui in #1270
clear xla compilation cache before each disagg server launch by @sixiang-google in #1271
Reduce image size and enhance caching by @wdhongtw in #1245
[Kernel][FusedMoE] Fix MoE crash and hang issues by @bythew3i in #1252
[Quantization] Add option to bypass quantized matmul kernel for W8A8-FP8 Compressed Tensors by @jrplatin in #1273
Replacing bit_width() with itemized_bits(). by @aman2930 in #1264
Enable All Tests on TPUv7 by @QiliangCui in #1279
add github action for check ready label by @boe20211 in #1269
[Bugfix][Depreacate] Update for vllm v0.13 by @kyuyeunk in #1284
Add default 'auto' MODEL_IMPL_TYPE that resolves based on architecture by @xingliu14 in #1255
[Misc] Fix how model dtype is being configured by @kyuyeunk in #1286
[Bugfix][Refactor] Fix compressed tensor moe init by @kyuyeunk in #1283
update run_in_docker script for running on local env by @ernie-chang in #1243
Remove pip install from setup_docker_env.sh. by @QiliangCui in #1292
Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1295
Revert "Update multihost disagg sh to prepare integrate with buildkit… by @QiliangCui in #1297
[Misc] Disable torchax.tensor logger warning by @kyuyeunk in #1301
Support overriding logic for hybrid kv cache padding by @kyuyeunk in #1285
Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1304
Update disagg multi host script health check logic by @mrjunwan-lang in #1306
[Misc][RPA] Update to use logger in kernel_hd64.py by @kyuyeunk in #1302
Update libtpu version for tpuv7. by @QiliangCui in #1305
Fix a test pipeline bug and add TODO. by @QiliangCui in #1309
Avoid installing CUDA related stuff by @wdhongtw in #1246
[Kernel][Misc] Remove jax.named_scope by @kyuyeunk in #1278
Use 50 bit uuid for KV transfer key to avoid GO trunk the int in GKE by @mrjunwan-lang in #1310
Use AttentionSelectorConfig in get_attn_backend_cls by @karan in #1313
Add Quantized Weights Support for MoE Layers by @kyuyeunk in #1300
Fix for vLLM's benchmarking case change. by @patemotter in #1316
Enable Pipeline Parallelism on Jax models by @Chenyaaang in #1077
Restrict PP size to either 1 or host size in ray by @Chenyaaang in #1318
Fix the lora column_parallel_packed test on v7x by @vanbasten23 in #1314
Add dummy placeholder for unsupported models in the support matrix by @boe20211 in #1291
Fix lora layer unit tests for v7x2. by @vanbasten23 in #1319
Use vllm models when PP is enabled by @Chenyaaang in #1321
Fix model loader unit test by @Chenyaaang in #1324
Integrate the E2E multi-host disagg serving into buildkite by @mrjunwan-lang in #1323
support fp8 compressed-tensors moe by @coolkp in #1320
[RPA] Optimize masking and sliding window by @kyuyeunk in #1325
Fix scale sharding in ep case by @coolkp in #1326
[Torchax] fp8 quantization skeleton by @xingliu14 in #1307
Update tuned block size by considering the sliding window. by @vanbasten23 in #1328
Add appache license. by @QiliangCui in #1339
Add pre-commit for adding license. by @QiliangCui in https://gith...

Contributors

helloworld1, patemotter, and 24 other contributors

Assets 2

12 Dec 20:28

RobMulla

v0.12.0

bcb35b7

v0.12.0

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Async Scheduler Enabled the async-scheduler in tpu-inference for improved performance on smaller models.

Spec Decoder EAGLE-3 Added support EAGLE-3 variant with verified performance for Llama 3.1-8B.

Out-of-Tree Model Support Load custom JAX models as plugins, enabling users to serve custom model architectures without forking or modifying vLLM internals.

Automated CI/CD and Pre-merge Check Improved the testing and validation pipeline with automated CI/CD and pre-merge checks to enhance stability and accelerate iteration. More improvements to come.

What's Changed

[Bug Fix] Fix small bug in server-based profiling init by @jrplatin in #872
[Disagg][Bugfix] add check for global devices in profiler start by @sixiang-google in #874
[CI] Fix imports to catchup vLLM's recent update. by @hfan in #876
[Kernel] Added a RPA V3 kernel variant optimized for head_dim=64 by @yaochengji in #875
Update README.md by @bvrockwell in #880
Remove convert_list_to_device_array to reduce latency between model forward pass by @Lumosis in #879
fix mock device error in profile enabling by @sixiang-google in #878
[RPA] Reduce VREG spill by optimize masking by @kyuyeunk in #818
[Doc] Fixed the docker path for the quick start guide by @hosseinsarshar in #885
Fix/docs links by @RobMulla in #873
Light rewording of jax model development readme by @gpolovets1 in #871
Revert "[CI] Fix imports to catchup vLLM's recent update." by @hfan in #887
docs: Clarify support matrix messaging by @RobMulla in #886
Docs rename reco page by @RobMulla in #888
Update README.md by @bvrockwell in #897
[Profiling] Pull Over the TPU Profiler from vLLM + add profiling docs by @jrplatin in #882
[Misc] Fix various vLLM import issues by @jrplatin in #900
Revert "[Misc] Fix various vLLM import issues" by @hfan in #902
[Misc] Fix failing phased-based profiling test by @jrplatin in #905
Added the docker login instructions by @hosseinsarshar in #891
Unpin upstream vllm version by @jcyang43 in #904
[Bug fix] Fix v7 HBM limit by @wenxindongwork in #903
Enable spmd on lora by @vanbasten23 in #829
Support --enforce-eager by @kyuyeunk in #907
[CI] Fixes to catchup with vllm changes by @hfan in #912
[Docker] Add V7X requirements and update Docker to accept option to build using it by @jrplatin in #916
Fix the jax device ordering. by @wang2yn84 in #915
Update the disagg multi host sh file to setup the disagg inference in… by @mrjunwan-lang in #922
[Llama4/JAX] Refactor RoPE Scaling, QK Norm, and No-RoPE Layer Config Handling for Maverick by @sierraisland in #923
[Bug fix + Qwix] Add JAX quantization YAMLs to WHL build + add fp8 quantization configs by @jrplatin in #929
Enable multi-host P/D and adopt the vllm distributed executor changes by @mrjunwan-lang in #932
fix the uniitest to adopt vllm API changes by @mrjunwan-lang in #933
[CI] Fix Qwen2.5 VL get_mrope_input_positions after vLLM change. by @kwang3939 in #934
[Disagg] Use pathways resharding api to handle transfer by @sixiang-google in #935
[Misc] Report TPU usage by @hfan in #925
[CI] Use real vLLM ModelConfig object in init_device test by @hfan in #937
update the ports to make the ports consistent in single host and multihost by @mrjunwan-lang in #938
[Spec Decoding] Merge jitted helpers for eagle3 by @Lumosis in #920
[GPT-OSS] JAX implementation of GPT-OSS by @bzgoogle in #861
[Bug fixes] Update vLLM imports by @jrplatin in #947
[Misc] Move numba installation to requirements.txt by @py4 in #948
[Multi-host] Fix bugs in the deployment script by @Lumosis in #940
Fix issues when running multiple LoRA tests on the v6e-8 machine. by @vanbasten23 in #926
[Bug fixes] Fix a few more vLLM imports + Dockerfile typo by @jrplatin in #953
Add the bgmv tests by @vanbasten23 in #942
[MMLU] Add chat-template support for MMLU by @bzgoogle in #952
[RPA] Add attention sink support to 64 dim variant of RPA kernel by @kyuyeunk in #958
Revert "Add the bgmv tests" by @vanbasten23 in #963
fix the vllm import issue for round_down by @mrjunwan-lang in #965
Update docs to include installation guide with building from source. by @RobMulla in #949
Reduce the host overhead for LoRA by @vanbasten23 in #930
[GPT-OSS] uncomment sink related changes as the kernel_hd64.py was merged by @bzgoogle in #966
Add bgmv test by @vanbasten23 in #964
[CI] Skip build if only docs/icons changed by @boe20211 in #908
[Spec Decoding] Fix precompilation by @Lumosis in #960
fix the bug in kv transfer params is None by @mrjunwan-lang in #969
[GPT-OSS] fix unstable sparse sum among different by @bzgoogle in #968
fused Moe by @bythew3i in #973
fix readme links to the docs by @RobMulla in #974
[Feature] Code implementation of Async Scheduler by @cychiuak in #924
[Misc] Fix observability config to prevent error from upstream by @py4 in #979
add unit test for tpu_connector.py by @mrjunwan-lang in #980
[Model] Add vision encoder and input embeddings merger warmup for Qwen2.5 VL model by @kwang3939 in #972
Fix the test of multimodal manager by @kwang3939 in #986
Fix the test of tpu_jax_runner by @kwang3939 in #989
[Misc] Attempt to fix hash mismatch in CI if it's because of incomplete download by @py4 in #994
[RPA] Update attention_sink to use prepare_inputs by @kyuyeunk in #993
[Misc] Only run JAX unit tests and few e2e tests for each PR in CI. by @py4 in #995
[Misc] Remove unused interfaces by @py4 in #990
[Misc] Fix buildkite yaml format. by @py4 in #997
Update README.md by @bvrockwell in #998
Fix kv cache shape for head_dim=64 by @yaochengji in #976
Add precommit hook for detecting missing init.py files by @jcyang43 in #1001
Fix grid size calculation in qwen2.5-vl vision encoder warmup by @kwang3939 in #1004
[Runner] Separate execute_model and sample_tokens to adapt upstream change. by @py4 in #1003
[Misc] Change buildkite pipeline to run all steps but skip some through command by @py4 in https://github.com/vllm-project/tpu-i...