Releases: vllm-project/tpu-inference
v0.13.2
This release brings several new features and improvements for vLLM TPU Inference.
Highlights
Ironwood Support All relevant dependencies have been rolled up to support Ironwood (v7x). CI/CD has been updated to reflect this change.
For further information on different build requirements for v7x compared to previous TPU generations (v6e and prior), please see the following documentation:
QuickStart
TPU Setup
P/D Disaggregated Serving over DCN Ray-based prefill/decode disaggregation with KV cache transfer over DCN.
Multi-Lora for Pytorch Models Multi-Lora support has landed for Pytorch model definitions from vLLM. JAX-native solution will be supported shortly.
Run:AI Model Streaming Run:AI model streamer is a direct Google Cloud Storage model download accelerator. It has been demonstrated to be the easiest and fastest way to pull models into GPU memory from GCS. We are now providing customers the same experience on TPUs.
What's Changed
- [Misc] Update tpu-info by @kyuyeunk in #1214
- fix: update nightly date format to YYYYMMDD by @ylangtsou in #1213
- [Kernel] Remove KV masking by performing full bkv fetches in the first 2 steps by @yaochengji in #1240
- Refactor moe codebase by @kyuyeunk in #1199
- [multihost] Add NEW_MODEL_DESIGN to additional_env_vars by @Lumosis in #1236
- [DP] Add model DP support for JAX GPT-OSS by @wenxindongwork in #1247
- Fix circle reference which cause tpu_platform failed to import by @mrjunwan-lang in #1251
- [Bug fix] Fix DP + Hybrid KV cache numerics by @wenxindongwork in #1249
- add get_kv_connector_handshake_metadata in tpu_worker by @mrjunwan-lang in #1254
- Integrate MLA v1 into DeepSeek-v3 by @gpolovets1 in #1190
- Fix bug that PP assign wrong rank in distributed TP by @mrjunwan-lang in #1256
- [Disagg] local disagg e2e test by @sixiang-google in #1237
- Fix image tests. by @QiliangCui in #1253
- Fixing a few failures in tests/test_quantization.py. by @gpolovets1 in #1258
- [RPA] Revert previous changes due to numeric issue by @kyuyeunk in #1242
- [Misc] Update torchax with fp4 support by @kyuyeunk in #1257
- Update support matrices by @boe20211 in #1232
- Update request_distribution in DP input preparation by @wenxindongwork in #1211
- Fix FP8 dtype type mismatch issue by @helloworld1 in #1235
- Add disagg test to v6e-8 queue by @sixiang-google in #1259
- Add an argument to TpuPlatform.get_attn_backend_cls to adopt interfac… by @QiliangCui in #1263
- Update README.md by @bvrockwell in #1197
- Backward compatibility for NEW_MODEL_DESIGN=True by @wenxindongwork in #1267
- Delete b/ from PR template. by @QiliangCui in #1268
- [Disagg] Refined e2e test cleanup by @sixiang-google in #1265
- Remove a branch with pl.when in fetching bkv by @rupengliu-meta in #1239
- Add a lora perf test by @vanbasten23 in #1272
- Fix moe layer from upstream change by @kyuyeunk in #1274
- [RPA] Pipeline flash attention in default kernel by @jrplatin in #1203
- First check-in to add ci/cd test on tpuv7x by @QiliangCui in #1270
- clear xla compilation cache before each disagg server launch by @sixiang-google in #1271
- Reduce image size and enhance caching by @wdhongtw in #1245
- [Kernel][FusedMoE] Fix MoE crash and hang issues by @bythew3i in #1252
- [Quantization] Add option to bypass quantized matmul kernel for W8A8-FP8 Compressed Tensors by @jrplatin in #1273
- Replacing bit_width() with itemized_bits(). by @aman2930 in #1264
- Enable All Tests on TPUv7 by @QiliangCui in #1279
- add github action for check ready label by @boe20211 in #1269
- [Bugfix][Depreacate] Update for vllm v0.13 by @kyuyeunk in #1284
- Add default 'auto' MODEL_IMPL_TYPE that resolves based on architecture by @xingliu14 in #1255
- [Misc] Fix how model dtype is being configured by @kyuyeunk in #1286
- [Bugfix][Refactor] Fix compressed tensor moe init by @kyuyeunk in #1283
- update run_in_docker script for running on local env by @ernie-chang in #1243
- Remove pip install from setup_docker_env.sh. by @QiliangCui in #1292
- Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1295
- Revert "Update multihost disagg sh to prepare integrate with buildkit… by @QiliangCui in #1297
- [Misc] Disable torchax.tensor logger warning by @kyuyeunk in #1301
- Support overriding logic for hybrid kv cache padding by @kyuyeunk in #1285
- Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1304
- Update disagg multi host script health check logic by @mrjunwan-lang in #1306
- [Misc][RPA] Update to use logger in kernel_hd64.py by @kyuyeunk in #1302
- Update libtpu version for tpuv7. by @QiliangCui in #1305
- Fix a test pipeline bug and add TODO. by @QiliangCui in #1309
- Avoid installing CUDA related stuff by @wdhongtw in #1246
- [Kernel][Misc] Remove jax.named_scope by @kyuyeunk in #1278
- Use 50 bit uuid for KV transfer key to avoid GO trunk the int in GKE by @mrjunwan-lang in #1310
- Use AttentionSelectorConfig in get_attn_backend_cls by @karan in #1313
- Add Quantized Weights Support for MoE Layers by @kyuyeunk in #1300
- Fix for vLLM's benchmarking case change. by @patemotter in #1316
- Enable Pipeline Parallelism on Jax models by @Chenyaaang in #1077
- Restrict PP size to either 1 or host size in ray by @Chenyaaang in #1318
- Fix the lora column_parallel_packed test on v7x by @vanbasten23 in #1314
- Add dummy placeholder for unsupported models in the support matrix by @boe20211 in #1291
- Fix lora layer unit tests for v7x2. by @vanbasten23 in #1319
- Use vllm models when PP is enabled by @Chenyaaang in #1321
- Fix model loader unit test by @Chenyaaang in #1324
- Integrate the E2E multi-host disagg serving into buildkite by @mrjunwan-lang in #1323
- support fp8 compressed-tensors moe by @coolkp in #1320
- [RPA] Optimize masking and sliding window by @kyuyeunk in #1325
- Fix scale sharding in ep case by @coolkp in #1326
- [Torchax] fp8 quantization skeleton by @xingliu14 in #1307
- Update tuned block size by considering the sliding window. by @vanbasten23 in #1328
- Add appache license. by @QiliangCui in #1339
- Add pre-commit for adding license. by @QiliangCui in https://gith...
v0.12.0
This release brings several new features and improvements for vLLM TPU Inference.
Highlights
Async Scheduler Enabled the async-scheduler in tpu-inference for improved performance on smaller models.
Spec Decoder EAGLE-3 Added support EAGLE-3 variant with verified performance for Llama 3.1-8B.
Out-of-Tree Model Support Load custom JAX models as plugins, enabling users to serve custom model architectures without forking or modifying vLLM internals.
Automated CI/CD and Pre-merge Check Improved the testing and validation pipeline with automated CI/CD and pre-merge checks to enhance stability and accelerate iteration. More improvements to come.
What's Changed
- [Bug Fix] Fix small bug in server-based profiling init by @jrplatin in #872
- [Disagg][Bugfix] add check for global devices in profiler start by @sixiang-google in #874
- [CI] Fix imports to catchup vLLM's recent update. by @hfan in #876
- [Kernel] Added a RPA V3 kernel variant optimized for head_dim=64 by @yaochengji in #875
- Update README.md by @bvrockwell in #880
- Remove convert_list_to_device_array to reduce latency between model forward pass by @Lumosis in #879
- fix mock device error in profile enabling by @sixiang-google in #878
- [RPA] Reduce VREG spill by optimize masking by @kyuyeunk in #818
- [Doc] Fixed the docker path for the quick start guide by @hosseinsarshar in #885
- Fix/docs links by @RobMulla in #873
- Light rewording of jax model development readme by @gpolovets1 in #871
- Revert "[CI] Fix imports to catchup vLLM's recent update." by @hfan in #887
- docs: Clarify support matrix messaging by @RobMulla in #886
- Docs rename reco page by @RobMulla in #888
- Update README.md by @bvrockwell in #897
- [Profiling] Pull Over the TPU Profiler from vLLM + add profiling docs by @jrplatin in #882
- [Misc] Fix various vLLM import issues by @jrplatin in #900
- Revert "[Misc] Fix various vLLM import issues" by @hfan in #902
- [Misc] Fix failing phased-based profiling test by @jrplatin in #905
- Added the docker login instructions by @hosseinsarshar in #891
- Unpin upstream vllm version by @jcyang43 in #904
- [Bug fix] Fix v7 HBM limit by @wenxindongwork in #903
- Enable spmd on lora by @vanbasten23 in #829
- Support --enforce-eager by @kyuyeunk in #907
- [CI] Fixes to catchup with vllm changes by @hfan in #912
- [Docker] Add V7X requirements and update Docker to accept option to build using it by @jrplatin in #916
- Fix the jax device ordering. by @wang2yn84 in #915
- Update the disagg multi host sh file to setup the disagg inference in… by @mrjunwan-lang in #922
- [Llama4/JAX] Refactor RoPE Scaling, QK Norm, and No-RoPE Layer Config Handling for Maverick by @sierraisland in #923
- [Bug fix + Qwix] Add JAX quantization YAMLs to WHL build + add fp8 quantization configs by @jrplatin in #929
- Enable multi-host P/D and adopt the vllm distributed executor changes by @mrjunwan-lang in #932
- fix the uniitest to adopt vllm API changes by @mrjunwan-lang in #933
- [CI] Fix Qwen2.5 VL get_mrope_input_positions after vLLM change. by @kwang3939 in #934
- [Disagg] Use pathways resharding api to handle transfer by @sixiang-google in #935
- [Misc] Report TPU usage by @hfan in #925
- [CI] Use real vLLM ModelConfig object in init_device test by @hfan in #937
- update the ports to make the ports consistent in single host and multihost by @mrjunwan-lang in #938
- [Spec Decoding] Merge jitted helpers for eagle3 by @Lumosis in #920
- [GPT-OSS] JAX implementation of GPT-OSS by @bzgoogle in #861
- [Bug fixes] Update vLLM imports by @jrplatin in #947
- [Misc] Move numba installation to requirements.txt by @py4 in #948
- [Multi-host] Fix bugs in the deployment script by @Lumosis in #940
- Fix issues when running multiple LoRA tests on the v6e-8 machine. by @vanbasten23 in #926
- [Bug fixes] Fix a few more vLLM imports + Dockerfile typo by @jrplatin in #953
- Add the bgmv tests by @vanbasten23 in #942
- [MMLU] Add chat-template support for MMLU by @bzgoogle in #952
- [RPA] Add attention sink support to 64 dim variant of RPA kernel by @kyuyeunk in #958
- Revert "Add the bgmv tests" by @vanbasten23 in #963
- fix the vllm import issue for round_down by @mrjunwan-lang in #965
- Update docs to include installation guide with building from source. by @RobMulla in #949
- Reduce the host overhead for LoRA by @vanbasten23 in #930
- [GPT-OSS] uncomment sink related changes as the kernel_hd64.py was merged by @bzgoogle in #966
- Add bgmv test by @vanbasten23 in #964
- [CI] Skip build if only docs/icons changed by @boe20211 in #908
- [Spec Decoding] Fix precompilation by @Lumosis in #960
- fix the bug in kv transfer params is None by @mrjunwan-lang in #969
- [GPT-OSS] fix unstable sparse sum among different by @bzgoogle in #968
- fused Moe by @bythew3i in #973
- fix readme links to the docs by @RobMulla in #974
- [Feature] Code implementation of Async Scheduler by @cychiuak in #924
- [Misc] Fix observability config to prevent error from upstream by @py4 in #979
- add unit test for tpu_connector.py by @mrjunwan-lang in #980
- [Model] Add vision encoder and input embeddings merger warmup for Qwen2.5 VL model by @kwang3939 in #972
- Fix the test of multimodal manager by @kwang3939 in #986
- Fix the test of tpu_jax_runner by @kwang3939 in #989
- [Misc] Attempt to fix hash mismatch in CI if it's because of incomplete download by @py4 in #994
- [RPA] Update attention_sink to use prepare_inputs by @kyuyeunk in #993
- [Misc] Only run JAX unit tests and few e2e tests for each PR in CI. by @py4 in #995
- [Misc] Remove unused interfaces by @py4 in #990
- [Misc] Fix buildkite yaml format. by @py4 in #997
- Update README.md by @bvrockwell in #998
- Fix kv cache shape for head_dim=64 by @yaochengji in #976
- Add precommit hook for detecting missing init.py files by @jcyang43 in #1001
- Fix grid size calculation in qwen2.5-vl vision encoder warmup by @kwang3939 in #1004
- [Runner] Separate execute_model and sample_tokens to adapt upstream change. by @py4 in #1003
- [Misc] Change buildkite pipeline to run all steps but skip some through command by @py4 in https://github.com/vllm-project/tpu-i...