Release v0.9.2 · vllm-project/vllm

Highlights

This release contains 452 commits from 167 contributors (31 new!)

NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.

Engine Core

Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
FlexAttention update – any head size, FP32 fallback (#20467, #19754).
Shared CachedRequestData objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).

Model Support

New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
Granite hybrid MoE configurations with shared experts are fully supported (#19652).

Large‑Scale Serving & Engine Improvements

Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).

Hardware & Performance

NVIDIA Blackwell
- SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, #19566, #20071, #19794)
- SM100: block‑scaled‑group GEMM, INT8/FP8 vectorization, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, #19572, #19168, #19085, #20290, #20331).
Intel GPU (V1) backend with Flash‑Attention support (#19560).
AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
- Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
- Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
- Add models and features supporting matrix. (#20230)

Quantization

Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).

API · CLI · Frontend

API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions (#19615, #20179, #19597).
Token‑level progress bar for LLM.beam_search and cached template‑resolution speed‑ups (#19301, #20065).
Image‑object support in llm.chat, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862).
CLI QoL: better parsing for -O/--compilation-config, batch‑size‑sweep benchmarking, richer --help, faster startup (#20156, #20516, #20430, #19941).
Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)

Platform & Deployment

No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
Security hardening – runtime (cloud)pickle imports forbidden (#18018).
Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
[doc] Make top navigation sticky by @reidliu41 in #19540
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets by @ekagra-ranjan in #18847
[Misc] Turn MOE_DP_CHUNK_SIZE into an env var by @varun-sundar-rabindranath in #19506
[Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant by @mgoin in #19452
[Doc] Unify structured outputs examples by @aarnphm in #18196
[V1] Resolve failed concurrent structred output requests by @russellb in #19565
Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" by @kouroshHakha in #19378
[BugFix] : Fix Batched DeepGemm Experts by @varun-sundar-rabindranath in #19515
[Bugfix] Fix EAGLE vocab embedding for multimodal target model by @zixi-qi in #19570
[Doc] uses absolute links for structured outputs by @aarnphm in #19582
[doc] fix incorrect link by @reidliu41 in #19586
[Misc] Correct broken docs link by @Zerohertz in #19553
[CPU] Refine default config for the CPU backend by @bigPYJ1151 in #19539
[Fix] bump mistral common to support magistral by @princepride in #19533
[Fix] The zip function in Python 3.9 does not have the strict argument by @princepride in #19549
use base version for version comparison by @BoyuanFeng in #19587
[torch.compile] reorganize the cache directory to support compiling multiple models by @youkaichao in #19064
[BugFix] Honor enable_caching in connector-delayed kvcache load case by @njhill in #19435
[Model] Fix minimax model cache & lm_head precision by @qscqesze in #19592
[Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl by @yewentao256 in #19573
[doc][mkdocs] fix the duplicate Supported features sections in GPU docs by @reidliu41 in #19606
[CUDA] Enable full cudagraph for FlashMLA by @ProExpertProg in #18581
[Doc] Add troubleshooting section to k8s deployment by @annapendleton in #19377
[torch.compile] Use custom ops when use_inductor=False by @WoosukKwon in #19618
Adding "AMD: Multi-step Tests" to amdproduction. by @Concurrensee in #19508
[BugFix] Fix DP Coordinator incorrect debug log message by @njhill in #19624
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. by @sahelib25 in #18354
[Bugfix][1/n] Fix the speculative decoding test by setting the target dtype by @houseroad in #19633
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts by @reidliu41 in #19593
[Bugfix] Fix auto dtype casting for BatchFeature by @Isotr0py in #19316
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization by @jiahanc in #19500
Only build CUTLASS MoE kernels on Hopper by @huydhn in #19648
[Bugfix] Don't attempt to use triton if no driver is active by @kzawora-intel in #19561
[Fix] Convert kv_transfer_config from dict to KVTransferConfig by @maobaolong in #19262
[Perf] Further tunings for SM100 FP8 CUTLASS kernel by @ilmarkov in #19566
[Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness by @houseroad in #19644
[Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check by @22quinn in #19339
[Benchmark] Refactor benchmark script for fp8 & int8 by @yewentao256 in #19627
Enable prefix caching with full cuda graphs by @WoosukKwon in #19617
[CI/Build] Fix torch nightly CI dependencies part 2 by @zou3519 in #19589
[Misc] Remove duplicate multiproc method setting for CPU platform by @Isotr0py in #19649
[MISC] Remove unused variableds in C++ by @houseroad in #19609
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker by @quanliu1991 in #18957
[Misc][Frontend] passthrough bad_words by @f14-bertolotti in #19564
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config by @yeqcharlotte in #19660
[TPU] support attention head dim smaller than 128 by @yaochengji in #19620
[MISC] typo fix by @andyxning in #19672
[CI] Add mteb testing for rerank models by @noooop in #19344
[Docs] Move multiproc doc to v1 dir by @russellb in #19651
[Kernel] GGUF MMVQ kernel for multiple input vectors by @SzymonOzog in #18754
[BugFix] Don't catch BaseException when dumping execute_model errors by @njhill in #19626
[DOC] Add reasoning capability to vLLM streamlit code by @Navanit-git in #19557
[Feature]:Allow for Granite MoE Hybrid models with only shared experts. by @shawntan in #19652
[Bugfix] Fix TP inference for Flex attention backend by @Isotr0py in #19657
[MISC] bump huggingface_hub pkg to 0.33.0 by @andyxning in #19547
[Bugfix] fix missing 'finish_reason': null in streaming chat by @chaunceyjiang in #19662
[Kernels] Use empty for modular MoE workspaces by @bnellnm in #19667
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) by @qscqesze in #19677
[V1] Change return type on get_multimodal_embeddings() by @russellb in #19446
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 by @dsikka in #19563
[Fix] Fall back to Gloo when NCCL backend is unavailable by @conroy-cheers in #19641
[doc] add project flag to gcloud TPU command by @davidxia in #19664
[Wheel Size] Only build FA2 8.0+PTX by @LucasWilkinson in #19336
[Frontend] add chunking audio for > 30s audio by @nguyenhoangthuan99 in #19597
[DOC] fix doc typos by @diliu0349 in #19600
Fixes IMA for TP w/ flex-attention by @drisspg in #19712
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager by @quanliu1991 in #19686
[Doc] Add missing llava family multi-image examples by @Isotr0py in #19698
Add a doc on how to update PyTorch version by @huydhn in #19705
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel by @jvlunteren in #19152
[doc][mkdocs] Add edit button to documentation by @reidliu41 in #19637
[doc] split "Other AI Accelerators" tabs by @davidxia in #19708
[V1][Kernel] Flashinfer HND KV cache layout by @NickLucche in #19280
[Mis] remove duplicate engine status checks by @googs1025 in #19647
[Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 by @Isotr0py in #19151
[Perf] Optimize moe_align_block_size CUDA kernel by @yewentao256 in #19572
Remove sm120 arch from sm100 cutlass kernel arch list by @mgoin in #19716
[Misc] Update lmcache connector with the latest connector apis by @YaoJiayi in #19441
[Bugfix] Fix faulty triton importing logic when using Ray for DP by @mgoin in #19734
[Feature][ROCm] Add full graph capture support for TritonAttentionBackend by @charlifu in #19158
[TPU] Update torch version to include paged attention kernel change by @Chenyaaang in #19706
[MISC] correct copy_blocks src_to_dists param type by @andyxning in #19696
[MISC] correct DeviceConfig device field static type analysis by @andyxning in #19699
[Misc] Add str for RequestStatus by @lk-chen in #19780
[V1] Add API docs for EncoderCacheManager by @russellb in #19294
[V1][P/D] An native implementation of xPyD based on P2P NCCL by @Abatom in #18242
[V1] Decouple GPU and TPU InputBatch by @afeldman-nm in #19778
[Minor] Zero-initialize attn output buffer by @WoosukKwon in #19784
[doc] fix the incorrect label by @reidliu41 in #19787
[Platform] Allow platform use V1 Engine by default by @wangxiyuan in #19792
[Qwen] Add tagging rule for Qwen related PRs by @houseroad in #19799
[Hardware][AMD] integrate aiter chunked prefill into vllm by @Zzz9990 in #18596
[Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully by @chaunceyjiang in #19725
[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc by @russellb in #19808
[v1] Support mamba2 by @heheda12345 in #19327
docs: fix Slack bulletpoint in README by @nathan-weinberg in #19811
Disable "Forbid direct 'import triton'" check for vllm/triton_utils/importing.py in an extensible way by @afeldman-nm in #19783
[Core] Do not copy array during hashing by @lgeiger in #19484
[TPU] Update torch-xla version to include paged attention tuned block change by @QiliangCui in #19813
[Core] More fixes to MultiModalEmbeddings type handling by @russellb in #19715
[Multimodal] Use fast processor for Qwen2/2.5-VL by @WoosukKwon in #19789
[BugFix] Fix use_cudagraph=False by @zou3519 in #19612
[Frontend] Expose custom args in OpenAI APIs by @afeldman-nm in #16862
Fix FA2 fallback for Blackwell V1 by @mgoin in #19781
[Misc][ROCm] Enforce no unused variable in ROCm C++ files by @houseroad in #19796
[Quantization] Modify the logic of BNB double quantization by @jeejeelee in #19742
Support embedding models in V1 by @maxdebayser in #16188
[Bugfix] Fix the linter by @houseroad in #19826
[Bugfix] Add check_health to v1 async client. by @kouroshHakha in #19821
Mark invariant normalizer in Gemma as non-persistent by @yhtang in #19788
[ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374 by @tjtanaa in #18990
[Misc] [ROCm] Prevent surplus tensor reshape by @zsolt-borbely-htec in #19803
raise exception for pin_lora by @andyxning in #19809
[Minor] Allow redirecting model path for HfRunner in test by @Isotr0py in #19795
Add xLAM tool parser support by @zuxin666 in #17148
[Frontend] Add optional token-level progress bar to LLM.beam_search by @NekoMimiUnagi in #19301
Fixing Chunked Prefill Test. by @Alexei-V-Ivanov-AMD in #19762
[Doc] Update V1 user guide for embedding models by @22quinn in #19842
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI by @bigPYJ1151 in #19838
[Core][Bugfix] Fix Online MM Beam Search by @alex-jw-brooks in #19688
[Frontend] early return chat format resolution when specified by @xzbdmw in #19735
[Benchmark][Bugfix] Fix Dataset Length Calculation by @robertgshaw2-redhat in #19868
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI by @Isotr0py in #19872
[CI][Neuron] Fail and exit on first error by @elaineyz in #19622
[Benchmark] Fix Value of type "SampleRequest" is not indexable by @b8zhong in #18032
[Chore]: qwen3-moe-type-hints-mistake by @Xerxes-cn in #19860
[Bugfix] Enable PP with AITER+V1 by @qli88 in #19822
[Bugfix][Ray] Set the cuda context eagerly in the ray worker by @kouroshHakha in #19583
[Misc] update cuda version by @reidliu41 in #19526
[Misc] refactor example - openai_transcription_client by @reidliu41 in #19851
[Kernel] correct cpu worker function parameter type by @andyxning in #19745
[Fix] import regex instead of re by @tdoublep in #19875
[Model] GPT2ForSequenceClassification model by @nie3e in #19663
[custom_op][vllm-plugin] update custom_op class to use op_registry by @xuechendi in #19164
Export NaNs in logits to scheduler_stats if output is corrupted by @vladmihailescu in #18777
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests by @bigPYJ1151 in #19901
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError by @andyxning in #19749
[Misc] Clean up useless code by @wangxiyuan in #19889
Fix: Check the type of params to be a Sequence not list. by @rabinadk1 in #19910
[Bugfix] Fix bnb 8bit model weights loading by @Isotr0py in #19917
[New model support]Support Tarsier2 by @princepride in #19887
[doc] add contact us in community by @reidliu41 in #19922
[Multimodal] Optimize Qwen2/2.5-VL startup time by @WoosukKwon in #19756
[Docs] Add GPT2ForSequenceClassification to supported models in docs by @nie3e in #19932
[Misc] add vllm_config in init by @andyxning in #19866
[MISC] add cpu_kvcache_space_bytes to CacheConfig by @andyxning in #19812
[Benchmark] fix request loss if "ping" is returned by @sywangyi in #19535
[CI/Build] Auto tag perf benchmarks related PRs by @22quinn in #19943
[doc] use snippets for contact us by @reidliu41 in #19944
[Misc] Update model-specific PR tagging by @ywang96 in #19949
[Misc] Simplify vllm bench cli subcommand implementation by @yeqcharlotte in #19948
[Chore] dedup logs by @aarnphm in #19955
[BugFix] Add an env to disable moe chunking to work around compile incompatibility by @yeqcharlotte in #19642
[Perf][CLI] Improve overall startup time by @aarnphm in #19941
[Core] feat: Implement Priority Scheduling in V1 Engine by @amitm02 in #19057
[Misc] Configurable timeout for execute_model RPC calls via env var by @jinqinn in #19544
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor by @Flink-ddd in #19643
[doc] Fold long code blocks to improve readability by @reidliu41 in #19926
[P/D][NixlConnector] Support tp_size > num_kv_heads deployments by @NickLucche in #19691
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when all transfer done by @lk-chen in #19874
[Doc] Update V1 status for decoder-only embedding models by @Isotr0py in #19952
[doc] use MkDocs collapsible blocks - supplement by @reidliu41 in #19973
[Bugfix] Fix CI bitsandbytes failure by @jeejeelee in #19969
[doc] improve readability for long commands by @reidliu41 in #19920
[Docs] Fix syntax highlighting of shell commands by @lgeiger in #19870
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case by @tlrmchlsmth in #19885
[Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 by @Isotr0py in #19956
[Misc] Add type alias ReqId and EngineId for better readability by @lk-chen in #19880
[Feature] Support sequence parallelism for static fp8 quantization by @cascade812 in #19181
[CI/Build] Push latest tag for cpu and neuron docker image by @22quinn in #19897
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend by @Jun-Howie in #19395
[Bugfix][Benchmark] Fix Marlin benchmark by @22quinn in #19929
[TPU] Fix tpu model runner test by @Chenyaaang in #19995
Update test case parameter to have the throughput above 8.0 by @QiliangCui in #19994
[Misc][Tools][Benchmark] Add profile to autotune script by @Chenyaaang in #19711
[doc] Fix broken link in the installation for CPU by @yankay in #19980
add some examples for other benchmark scripts by @reidliu41 in #19893
[PERF] Speedup of MRoPE prepare inputs by @vadiklyutiy in #19939
[Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 by @bigPYJ1151 in #20014
refactor example - qwen3_reranker by @reidliu41 in #19847
[Fix][V1] Remove --scheduling-policy oracle by @amitm02 in #20010
[Perf] Improve/Fix-regression for FA3 in High QPS regimes by @LucasWilkinson in #19463
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. by @dtransposed in #19423
[BugFix] Fix multi-node offline data parallel by @njhill in #19937
[P/D] Asynchronously do _nixl_handshake by @lk-chen in #19836
[Feature] Integrate new deepgemm by @yewentao256 in #19820
[Easy] Remove submodule added in #19463 by @b8zhong in #20039
use .dev for version comparison with pytorch nightly release by @BoyuanFeng in #20031
cmake: Update vllm_flash_attn for vllm_kernels by @seemethere in #20032
[Llama4] Update attn_temperature_tuning by @b8zhong in #19997
Revert "[Feature] Integrate new deepgemm (#19820)" by @yewentao256 in #20049
Revert "Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor" by @Isotr0py in #20030
Move to a faster base64 implementation by @h-avsha in #19984
[Frontend] speed up import time of vllm.config by @davidxia in #18036
[Refactor] Remove duplicate ceil_div by @yewentao256 in #20023
[Feat][CLI] enforce-include-usage by @max-wittig in #19695
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. by @bnellnm in #19717
[Chore] debloat some initial logs by @aarnphm in #19438
[BugFix] Fix full-cuda-graph illegal memory access in FA3 by @LucasWilkinson in #20057
[doc] add reference link for Intel XPU by @reidliu41 in #20064
[Doc] Guide for Incremental Compilation Workflow by @mgoin in #19109
[V1][Speculative Decoding] Fix DeepSeek MTP by @cjackal in #20022
[Frontend] Add /v1/audio/translations OpenAI API endpoint by @NickLucche in #19615
[Quantization] Add compressed-tensors emulations support for NVFP4 by @dsikka in #19879
[Fix] Support cls pooling in ModernBertPooler by @lsz05 in #20067
static_scaled_fp8_quant should not run when scale.numel is not 1 by @eldarkurtic in #20076
[PD] let toy proxy handle /chat/completions by @lk-chen in #19730
[Misc] Add parallel state node_count function by @njhill in #20045
Fix the path to the testing script. by @QiliangCui in #20082
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine by @izhuhaoran in #20062
[TPU][Bugfix] fix kv cache padding by @yaochengji in #20048
[P/D] Avoid stranding blocks in P when aborted in D's waiting queue by @njhill in #19223
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN by @Chenyaaang in #19919
[CI] Add SM120 to the Dockerfile by @mgoin in #19794
[Bugfix] Fix Mistral tool-parser regex for nested JSON by @mgoin in #20093
[PD] Skip tp_size exchange with rank0 by @NickLucche in #19413
[Benchmark][Bug] Fix multiple bugs in bench and add args to spec_decode offline by @ekagra-ranjan in #20083
[Bugfix] Allow CUDA_VISIBLE_DEVICES='' in Platform.device_id_to_physical_device_id by @eicherseiji in #18979
[Doc] Update docs for New Model Implementation by @DarkLight1337 in #20115
[Refactor] Remove unused library by @yewentao256 in #20099
[CPU] Fix torch version in x86 CPU backend by @bigPYJ1151 in #19258
[Misc] Use collapsible blocks for benchmark examples. by @reidliu41 in #20017
[Docs] Improve frameworks/helm.md by @windsonsea in #20113
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) by @tjtanaa in #19904
Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" by @mgoin in #20128
[Bug Fix] Fix address/port already in use error for pplx test by @yewentao256 in #20094
[Doc] Automatically signed-off by PyCharm by @noooop in #20120
[Doc] Auto sign-off for VSCode by @DarkLight1337 in #20132
[Doc] Rename page titles by @DarkLight1337 in #20130
Spam folks if config.py changes by @tlrmchlsmth in #20131
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. by @jikunshang in #19560
[TPU] add kv cache update kernel by @yaochengji in #19928
[Refactor] Rename commnication utils by @yewentao256 in #20091
[Doc] correct LoRA capitalization by @kyolebu in #20135
[Feature] Expert Parallelism Load Balancer (EPLB) by @abmfy in #18343
[CI Failure] Fix OOM with test_oot_registration_embedding by @mgoin in #20144
[Quantization] Bump to use latest compressed-tensors by @dsikka in #20033
[Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler by @ilmarkov in #20071
[Bugfix] Build moe_data for both sm100 and sm90 by @mgoin in #20086
[Feature][Rocm] add quick all reduce for rocm by @lihaoyang-amd in #19744
[CI] Sync test dependency with test.in for torch nightly by @yangw-dev in #19632
[Fix] Fix gemma CI test failing on main by @tdoublep in #20124
[Model][1/N] Automatic conversion of CrossEncoding model by @noooop in #20012
[Perf][Frontend]: eliminate api_key and x_request_id headers middleware overhead by @Yazan-Sharaya in #19946
Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn by @xuechendi in #20143
Gemma3n (Text-only) by @robertgshaw2-redhat in #20134
[Bugfix] Fix flaky failure when getting DP ports by @mgoin in #20151
[Perf][Frontend] Cached resolution for resolving chat templates by @ilyal-cerebras in #20065
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12 by @hyoon1 in #19891
[Fix][torch.compile] Enable custom ops by default when Inductor off by @ProExpertProg in #20102
[Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. by @bnellnm in #20152
[Bugfix] Fix some narrowing conversion warnings by @tlrmchlsmth in #20141
[CI/Build] Allow hermetic builds by @fabiendupont in #18064
[CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to revision due to model changes by @mgoin in #20199
[Misc] Add type assertion of request_id for LLMEngine.add_request by @SHA-4096 in #19700
Fix num_token_padding support for static per-tensor scaled_fp8_quant by @mgoin in #20188
fix ci issue distributed 4 gpu test by @yewentao256 in #20204
[Bugfix] Properly reject requests with empty list guided_choice by @mgoin in #20195
[BugFix] Fix the incorrect func name in the comments. (config.py) by @1195343015 in #20185
[CI/Build] Add new CI job to validate Hybrid Models for every PR by @tdoublep in #20147
[Frontend] Generalize v1/audio/transcriptions endpoint by @NickLucche in #20179
[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution by @s3woz in #20137
[Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx by @yewentao256 in #20187
[CI Fix] Try fixing eagle e2e test OOM by reducing block allocation by @mgoin in #20213
[Quantization] Add compressed-tensors NVFP4 MoE Support by @dsikka in #19990
Fix cuda_archs_loose_intersection when handling sm_*a by @huydhn in #20207
[Model] support dots1 by @redmoe-moutain in #18254
[BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert by @xuechendi in #20202
[Misc] Fix import by @WoosukKwon in #20233
[doc] Add Slack and Forum to the top navigation by @reidliu41 in #20208
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model by @noiji in #19598
[Bugfix] Fix processor initialization in transformers 4.53.0 by @Isotr0py in #20244
[Quantization] Improve BitsAndBytesModelLoader by @jeejeelee in #20242
[Docs] Fix 1-2-3 list in v1/prefix_caching.md by @windsonsea in #20243
[Bugfix] fix quark ptpc by @lihaoyang-amd in #20251
[Spec Decode] Refactor spec decoding into a separate function by @WoosukKwon in #20238
[Spec Decode] Clean up spec decode example by @WoosukKwon in #20240
[Optimization] Use Shared CachedRequestData Instance Across All Requests by @WoosukKwon in #20232
[Unit Test] Add unit test for deep gemm by @yewentao256 in #20090
[Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models by @kylesayrs in #20058
[Refactor] Remove useless pdb comment by @yewentao256 in #20266
[Bugfix][V1][P/D]Fix the issue of occasional garbled output for P2pNcclConnector by @Abatom in #20263
[CLI] Improve CLI arg parsing for -O/--compilation-config by @ProExpertProg in #20156
[Bugfix] Fix include prompt in stream response when echo=true by @fyuan1316 in #15233
[Misc] Fix spec decode example by @WoosukKwon in #20296
[Example] add one-click runnable example for P2P NCCL XpYd by @KuntaiDu in #20246
[CI][Intel Gaudi][vllm-Plugin]Add CI for hpu-plugin-v1-test by @xuechendi in #20196
[Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA by @chewong in #15897
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference by @sakogan in #18768
[V1] Only print cudagraph tqdm on rank 0 with is_global_first_rank by @mgoin in #19516
Fix numel() downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2 by @r-barnes in #17082
[Misc] add xgrammar for arm64 by @prashantgupta24 in #18359
Enable ZP Support for Machete by @czhu-cohere in #20268
[CPU] Update custom ops for the CPU backend by @bigPYJ1151 in #20255
[Bugfix] Fix deepep tests by @varun-sundar-rabindranath in #20288
[Misc] remove redundant char by @kebe7jun in #20287
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine by @tywuAMD in #19067
[doc] fix the incorrect logo in dark mode by @reidliu41 in #20289
[Perf] Validate @config in pre-commit instead of dynamically by @lionelvillard in #20200
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper by @kylesayrs in #20046
[Misc] Minor refactor of NIXL background handshake by @NickLucche in #20068
Add GLM4.1V model (Draft) by @zRzRzRzRzRzRzR in #19331
[Model]Add Tencent HunYuanMoEV1 Model Support by @aiyiwang2025 in #20114
[Misc] Minor refactoring for scheduler by @WoosukKwon in #20299
[Docs] Update transcriptions API to use openai client with stream=True by @NickLucche in #20271
[CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling by @WoosukKwon in #20301
[Frontend] Expand tools even if tool_choice="none" by @okdshin in #17177
[V1] [ROCm] Enable EP with AITER Fused MoE by @tjtanaa in #20270
[Optimization] Cache sampled token ids in model runner by @WoosukKwon in #20291
remove unused variables in marlin_template.h by @zhoutianzi666 in #20236
[Refactor] Refactor import utils by @yewentao256 in #20269
Enable group size 64 for Machete by @czhu-cohere in #20290
[Kernel][Bugfix] Fixup some warnings in nvfp4_blockwise_moe when CUDA < 12.8 by @tlrmchlsmth in #20324
[UT][intel GPU] use current_platform instead of device hardcode in v1 tests by @Liangliang-Ma in #20169
[Refactor] Remove duplicate find_free_port by @yewentao256 in #20333
[Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON by @yewentao256 in #20334
[Misc][Doc] Add missing comment for LLM by @draftbk in #20285
[FIX][Intel GPU]fix ipex flash_attn_varlen_func api missing parameter by @jikunshang in #20348
[Bugfix] Fix dynamic rotary embedding by @DarkLight1337 in #20343
fix[Docs]: link anchor is incorrect #20309 by @yyzxw in #20315
[Doc][TPU] Add models and features supporting matrix. by @QiliangCui in #20230
[TPU] kv cache update kernel supports dynamic grid by @yaochengji in #20235
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. by @huachenheli in #20105
[Model][VLM] Support Keye-VL-8B-Preview by @Kwai-Keye in #20126
[Bugfix] Keye-VL compatibility with tok_kwargs (#20058) by @DarkLight1337 in #20353
[Docs] Fix indentations for 2-level items in deprecation_policy.md by @windsonsea in #20352
[Docs] Make TPU ref prettier in google_tpu.md by @windsonsea in #20356
[Model] Add Ernie4.5 and Ernie4.5MoE Model Support by @CSWYF3634076 in #20220
[Build/CI] Automatically tag DeepSeek related PRs by @houseroad in #20370
[NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) by @kaln27 in #17280
[Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models by @huaqiangwang in #20322
[Model] Adds support for SlimMoE models Phi-tiny-MoE-instruct by @zichongli5 in #20286
Documentation update tool_calling: mapping back to function from response by @cronoik-inceptionai in #20373
[Kernels] MoE refactor by @bnellnm in #19636
[V1] LogitsProcessor programming model by @afeldman-nm in #16728
[Minor] Clean up incorrect comment in test by @njhill in #20382
[Misc] add handler HF_TOKEN is emptry string by @lengrongfu in #20369
[ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) by @vllmellm in #20254
[DP] Support external DP Load Balancer mode by @njhill in #19790
[Docs] Update EAGLE example by @NickLucche in #20375
[Bugfix] Fixes for FlashInfer's TORCH_CUDA_ARCH_LIST by @tlrmchlsmth in #20136
[BugFix] Fix DP headless mode arg validation by @njhill in #20398
Enable CPU nightly performance benchmark and its Markdown report by @louie-tsai in #18444
[Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py by @bnellnm in #20381
[Misc] Small: Fix video loader return type annotations. by @huachenheli in #20389
[Bugfix][CI/CD][CPU] Fix CPU CI tests by @bigPYJ1151 in #20383
[TPU] Add a case to cover RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 by @QiliangCui in #20385
[Feature] Support MiniMax-M1 function calls features by @qscqesze in #20297
[Tests] Update online DP tests to verify that requests are balanced by @njhill in #20157
[Misc] Add rules to label Speculative Decoding Related PRs by @draftbk in #20406
[doc] fix link by @reidliu41 in #20417
[Docs] Replace two list with tables in intel_gaudi.md by @windsonsea in #20414
[Core] Move multimodal placeholder from chat utils to model definition by @DarkLight1337 in #20355
[Kernel] refactor cpu worker v0 cache dtype by @andyxning in #20080
[CI/Build][CPU] Enable cross compilation in CPU release pipeline by @bigPYJ1151 in #20423
[Quantization] Bump to use latest bitsandbytes by @jeejeelee in #20424
[Model][2/N] Automatic conversion of CrossEncoding model by @noooop in #19978
[Misc] Automatically tag PRs to add new models by @Isotr0py in #20222
[Frontend] improve vllm bench <bench_type> --help display by @reidliu41 in #20430
[Bugfix] Fix flaky test_streaming_response test by @NickLucche in #20363
[Frontend] fix duplicate output for bench subcmd by @reidliu41 in #20446
[CI] Trimming some failing test groups from AMDPRODUCTION. by @Alexei-V-Ivanov-AMD in #20390
[Misc] Clean up InternVL family config registration by @Isotr0py in https://github.com/vllm-project/vllm/pull/19992
[Misc] adjust for ipv6 for mookcacke url parse by @andyxning in https://github.com/vllm-project/vllm/pull/20107
[Misc] Remove _maybe_ignore_quant_config from GLM4.1v by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/20432
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400
[Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510
Change warn_for_unimplemented_methods to debug by @mgoin in https://github.com/vllm-project/vllm/pull/20455
[Platform] Add custom default max tokens by @gmarinho2 in https://github.com/vllm-project/vllm/pull/18557
Add ignore consolidated file in mistral example code by @princepride in https://github.com/vllm-project/vllm/pull/20420
[Misc] small update by @reidliu41 in https://github.com/vllm-project/vllm/pull/20462
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in https://github.com/vllm-project/vllm/pull/20365
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in https://github.com/vllm-project/vllm/pull/20331
[Misc] Add SPDX-FileCopyrightText by @jeejeelee in https://github.com/vllm-project/vllm/pull/20428
Support Llama 4 for fused_marlin_moe by @mgoin in https://github.com/vllm-project/vllm/pull/20457
[Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in https://github.com/vllm-project/vllm/pull/18809
[Model][3/N] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/20168
[Doc] Fix classification table in list of supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20489
[CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in https://github.com/vllm-project/vllm/pull/18193
[Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in https://github.com/vllm-project/vllm/pull/20395
Enable V1 for Hybrid SSM/Attention Models by @tdoublep in https://github.com/vllm-project/vllm/pull/20016
[feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in https://github.com/vllm-project/vllm/pull/19757
[CI Bugfix] Fix pre-commit failures on main by @mgoin in https://github.com/vllm-project/vllm/pull/20502
[Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in https://github.com/vllm-project/vllm/pull/20497
[Misc] Add security warning for development mode endpoints by @reidliu41 in https://github.com/vllm-project/vllm/pull/20508
[doc] small fix by @reidliu41 in https://github.com/vllm-project/vllm/pull/20506
[Misc] Remove the unused LoRA test code by @jeejeelee in https://github.com/vllm-project/vllm/pull/20494
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in https://github.com/vllm-project/vllm/pull/20507
[v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in https://github.com/vllm-project/vllm/pull/19754
[Misc] Add logger.exception for TPU information collection failures by @reidliu41 in https://github.com/vllm-project/vllm/pull/20510
[Misc] remove unused import by @reidliu41 in https://github.com/vllm-project/vllm/pull/20517
test_attention compat with coming xformers change by @bottler in https://github.com/vllm-project/vllm/pull/20487
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/20491
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in https://github.com/vllm-project/vllm/pull/20509
[BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/20513
[TPU][Bugfix] fix the MoE OOM issue by @yaochengji in https://github.com/vllm-project/vllm/pull/20339
[Frontend] Support image object in llm.chat by @sfeng33 in https://github.com/vllm-project/vllm/pull/19635
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in https://github.com/vllm-project/vllm/pull/20516
[Misc] call the pre-defined func by @reidliu41 in https://github.com/vllm-project/vllm/pull/20518
[V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20412
[V1] Support any head size for FlexAttention backend by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20467
[BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20530
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20527

New Contributors

@py-andy-c made their first contribution in #19399
@2niuhe made their first contribution in #19394
@leopardracer made their first contribution in #19442
@artetaout made their first contribution in #19085
@runzhen made their first contribution in #19453
@strutive07 made their first contribution in #19522
@yewentao256 made their first contribution in #19233
@mobicham made their first contribution in #19265
@kouroshHakha made their first contribution in #19378
@BoyuanFeng made their first contribution in #19587
@sahelib25 made their first contribution in #18354
@jiahanc made their first contribution in #19500
@quanliu1991 made their first contribution in #18957
@f14-bertolotti made their first contribution in #19564
@Navanit-git made their first contribution in #19557
@nguyenhoangthuan99 made their first contribution in #19597
@diliu0349 made their first contribution in #19600
@Zzz9990 made their first contribution in #18596
@yhtang made their first contribution in #19788
@zsolt-borbely-htec made their first contribution in #19803
@zuxin666 made their first contribution in #17148
@NekoMimiUnagi made their first contribution in #19301
@xzbdmw made their first contribution in #19735
@Xerxes-cn made their first contribution in #19860
@nie3e made their first contribution in #19663
@vladmihailescu made their first contribution in #18777
@rabinadk1 made their first contribution in #19910
@amitm02 made their first contribution in #19057
@jinqinn made their first contribution in #19544
@Flink-ddd made their first contribution in #19643
@Jun-Howie made their first contribution in #19395
@seemethere made their first contribution in #20032
@h-avsha made their first contribution in #19984
@max-wittig made their first contribution in #19695
@lsz05 made their first contribution in #20067
@kyolebu made their first contribution in #20135
@lihaoyang-amd made their first contribution in #19744
@Yazan-Sharaya made their first contribution in #19946
@ilyal-cerebras made their first contribution in #20065
@fabiendupont made their first contribution in #18064
@SHA-4096 made their first contribution in #19700
@1195343015 made their first contribution in #20185
@redmoe-moutain made their first contribution in #18254
@noiji made their first contribution in #19598
@chewong made their first contribution in #15897
@sakogan made their first contribution in #18768
@czhu-cohere made their first contribution in #20268
@aiyiwang2025 made their first contribution in #20114
@okdshin made their first contribution in #17177
@zhoutianzi666 made their first contribution in #20236
@yyzxw made their first contribution in #20315
@Kwai-Keye made their first contribution in #20126
@CSWYF3634076 made their first contribution in #20220
@kaln27 made their first contribution in #17280
@huaqiangwang made their first contribution in #20322
@zichongli5 made their first contribution in #20286
@cronoik-inceptionai made their first contribution in #20373
@sangbumlikeagod made their first contribution in https://github.com/vllm-project/vllm/pull/18809
@djmmoss made their first contribution in https://github.com/vllm-project/vllm/pull/19757
@GuyStone made their first contribution in https://github.com/vllm-project/vllm/pull/20497
@bottler made their first contribution in https://github.com/vllm-project/vllm/pull/20487

Full Changelog: v0.9.1...v0.9.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.9.2

Highlights

Engine Core

Model Support

Large‑Scale Serving & Engine Improvements

Hardware & Performance

Quantization

API · CLI · Frontend

Platform & Deployment

What's Changed

New Contributors

Contributors

Uh oh!