Releases: vllm-project/vllm-omni
v0.12.0rc1
vLLM-Omni v0.12.0rc1 Pre-Release Notes Highlights
Highlights
This release features 187 commits from 45 contributors (34 new contributors)!
vLLM-Omni v0.12.0rc1 is a major RC milestone focused on maturing the diffusion stack, strengthening OpenAI-compatible serving, expanding omni-model coverage, and improving stability across platforms (GPU/NPU/ROCm). It also rebases on vLLM v0.12.0 for better alignment with upstream (#335).
Breaking / Notable Changes
- Unified diffusion stage naming & structure: cleaned up legacy
Diffusion*paths and aligned onGeneration*-style stages to reduce duplication (#211, #163). - Safer serialization: switched
OmniSerializerfrompickleto MsgPack (#310). - Dependency & packaging updates: e.g., bumped
diffusersto 0.36.0 (#313) and refreshed Python/formatting baselines for the v0.12 release (#126).
Diffusion Engine: Architecture + Performance Upgrades
-
Core refactors for extensibility: diffusion model registry refactored to reuse vLLM’s
ModelRegistry(#200), improved diffusion weight loading and stage abstraction (#157, #391). -
Acceleration & parallelism features:
- Cache-DiT with a unified cache backend interface (#250)
- TeaCache integration and registry refactors (#179, #304, #416)
- New/extended attention & parallelism options: Sage Attention (#243), Ulysses Sequence Parallelism (#189), Ring Attention (#273)
- torch.compile optimizations for DiT and RoPE kernels (#317)
Serving: Stronger OpenAI Compatibility & Online Readiness
- DALL·E-compatible image generation endpoint (
/v1/images/generations) (#292), plus online serving fixes for image generation (#499). - Added OpenAI create speech endpoint (#305).
- Per-request modality control (output modality selection) (#298) with API usage examples (#411).
- Early support for streaming output (#367), request abort (#486), and request-id propagation in responses (#301).
Omni Pipeline: Multi-stage Orchestration & Observability
- Improved inter-stage plumbing: customizable process between stages and reduced coupling on
request_idsin model forward paths (#458). - Better observability and debugging: torch profiler across omni stages (#553), improved traceback reporting from background workers (#385), and logging refactors (#466).
Expanded Model Support (Selected)
-
Qwen-Omni / Qwen-Image family:
- Qwen-Omni offline inference with local files (#167)
- Qwen-Image-2512 support(#547)
- Qwen-Image-Edit support (including multi-image input variants and newer releases, Qwen-Image-Edit Qwen-Image-Edit-2509 Qwen-Image-Edit-2511) (#196, #330, #321)
- Qwen-Image-Layered model support (#381)
- Multiple fixes for Qwen2.5/Qwen3-Omni batching, examples, and OpenAI sampling parameter compatibility (#451, #450, #249)
-
Diffusion / video ecosystem:
Platform & CI Coverage
- ROCm / AMD: documented ROCm setup (#144) and added ROCm Dockerfile + AMD CI (#280).
- NPU: added NPU CI workflow (#231) and expanded NPU support for key Omni models (e.g., Qwen3-Omni, Qwen-Image series) (#484, #463, #485), with ongoing cleanup of NPU-specific paths (#597).
- CI and packaging improvements: diffusion CI, wheel compilation, and broader UT/E2E coverage (#174, #288, #216, #168).
What's Changed
- [Misc] Update link in issue template by @ywang96 in #155
- [Misc] Qwen-Omni support offline inference with local files by @SamitHuang in #167
- [diffusion] z-image support by @ZJY0516 in #149
- [Doc] Fix wrong examples URLs by @wjcwjc77 in #166
- [Doc] Update Security Advisory link by @DarkLight1337 in #173
- [Doc] change
vllm_omnitovllm-omniby @princepride in #177 - [Docs] Supplement volunteers and faq docs by @Gaohan123 in #182
- [Bugfix] Init early toch cuda by @knlnguyen1802 in #185
- [Docs] remove Ascend word to make docs general by @gcanlin in #190
- [Doc] Add installation part for pre built docker. by @congw729 in #141
- [CI] add diffusion ci by @ZJY0516 in #174
- [Misc] Add stage config for Qwen3-Omni-30B-A3B-Thinking by @linyueqian in #172
- [Doc]Fixed some spelling errors by @princepride in #199
- [Chore]: Refactor diffusion model registry to reuse vLLM's ModelRegistry by @Isotr0py in #200
- [FixBug]online serving fails for high-resolution videos by @princepride in #198
- [Engine] Remove Diffusion_XX which duplicates with Generation_XX by @tzhouam in #163
- [bugfix] qwen2.5 omni does not support chunked prefill now by @fake0fan in #193
- [NPU][Refactor] Rename Diffusion* to Generation* by @gcanlin in #211
- [Diffusion] Init Attention Backends and Selector for Diffusion by @ZJY0516 in #115
- [E2E] Add Qwen2.5-Omni model test with OmniRunner by @gcanlin in #168
- [Docs]Fix doc wrong link by @princepride in #223
- [Diffusion] Refactor diffusion models weights loading by @Isotr0py in #157
- Fix: Safe handling for multimodal_config to avoid 'NoneType' object h… by @qibaoyuan in #227
- [Bugfix] Fix ci bug for qwen2.5-omni by @Gaohan123 in #230
- [Core] add clean up method for diffusion engine by @ZJY0516 in #219
- [BugFix] Fix qwen3omni thinker batching. by @yinpeiqi in #207
- [Bugfix] Support passing vllm cli args to online serving in vLLM-Omni by @Gaohan123 in #206
- [Docs] Add basic usage examples for diffusion by @SamitHuang in #222
- [Model] Add Qwen-Image-Edit by @SamitHuang in #196
- update docs/readme.md and design folder by @hsliuustc0106 in #234
- [CI] Add Qwen3-omni offline UT by @R2-Y in #216
- [typo] fix doc readme by @hsliuustc0106 in #242
- [Model] Fuse Z-Image's
qkv_projandgate_up_projby @Isotr0py in #226 - [bugfix] Fix QwenImageEditPipeline transformer init by @dougbtv in #245
- [Bugfix] Qwen2.5-omni Qwen3-omni online gradio.py example fix by @david6666666 in #249
- [Bugfix] fix issue251, qwen3 omni does not support chunked prefill now by @david6666666 in #256
- [Bugfix]multi-GPU tp scenarios, devices: "0,1" uses physical IDs instead of logical IDs by @david6666666 in #253
- [Bugfix] Remove debug code in AsyncOmni.del to fix resource leak by @princepride in #260
- update arch overview by @hsliuustc0106 in #258
- [Feature] Omni Connector + ray supported by @natureofnature in #215
- [Misc] fix stage config describe and yaml format by @david6666666 in #265
- update desgin docs by @hsliuustc0106 in #269
- [Model] Add Wan2.2 text-to-video support by @linyueqian in #202
- [Doc] [ROCm]: Document the steps to run vLLM Omni on ROCm by @tjtanaa in #144
- [Entrypoints] Minor optimization in the orchestrator's final stage determination logic by @RuixiangMa in #275
- [Doc] update offline inference doc and offline_inference examples by @david6666666 in #274
- [Feature] teacache integration by @LawJarp-A in #179
- [CI] Qwen3-Omni online test by @R2-Y in #257
- [Doc] fix docs Feature Design and Module Design by @hsliuustc0106 in #283
- [CI] Test ready label by @ywang96 in #299
- [Doc] fix offline inference and online serving describe by @david6666666 in #285
- [CI] Adjust folder by @congw729 in #300
- [Diffusion][Attention] sage attention backend by @ZJY0516 in https://github.com/vllm-project/vllm-omni/...
0.11.0rc1
Initial (Pre)-release of the vLLM-Omni Project
vLLM-Omni is a framework that extends its support for omni-modality model inference and serving. This pre-release is built on top of vllm==0.11.0, and same version number is used for the ease of tracking the dependency.
Please check out our documentation and we welcome any feedbacks & contributions!
What's Changed
- init the folder directories for vLLM-omni by @hsliuustc0106 in #1
- init main repo structure and demonstrate the AR + DiT demo for omni models by @hsliuustc0106 in #6
- Add PR and issue templates from vLLM project by @hsliuustc0106 in #8
- update RFC template by @hsliuustc0106 in #9
- [Model]Add Qwen2.5-Omni model components by @tzhouam in #12
- [Engine] Add entrypoint class and stage management by @Gaohan123 in #13
- [Model] Add end2end example and documentation for qwen2.5-omni by @Gaohan123 in #14
- [Worker]Feat/ar gpu worker and model runner by @tzhouam in #15
- [Worker]Refactor GPU diffusion model runner and worker by @tzhouam in #16
- [Worker]Add OmniGPUModelRunner and OmniModelInputForGPU classes by @tzhouam in #17
- [Engine]Refactor output processing for multimodal capabilities in vLLM-omni by @tzhouam in #20
- [Inputs, Engine]Add Omni model components and input processing for hidden states support by @tzhouam in #18
- [Core]Add scheduling components for vLLM-omni by @tzhouam in #19
- add precommit by @Gaohan123 in #32
- End2end fixup by @tzhouam in #35
- Remove unused files and fix some bugs by @Gaohan123 in #36
- [bugfix] fix problem of installation by @Gaohan123 in #44
- [Bugfix] Further supplement installation guide by @Gaohan123 in #46
- [Bugfix] fix huggingface download problem for spk_dict.pt by @Gaohan123 in #47
- [Refractor] Dependency refractored to vLLM v0.11.0 by @Gaohan123 in #48
- [fix] Add support for loading model from a local path by @qibaoyuan in #52
- [Feature] Multi Request Stream for Sync Mode by @tzhouam in #51
- [Docs] Setup Documentation System and Re-organize Dependencies by @SamitHuang in #49
- [fix] adapt hidden state device for multi-hardware support by @qibaoyuan in #61
- [Feature] Support online inference by @Gaohan123 in #64
- CI Workflows. by @congw729 in #50
- [CI] fix ci and format existing code by @ZJY0516 in #71
- [CI] disable unnecessary ci and update pre-commit by @ZJY0516 in #80
- update readme for v0.11.0rc1 release by @hsliuustc0106 in #69
- [CI] Add script for building wheel. by @congw729 in #75
- [Feature] support multimodal inputs with multiple requests by @Gaohan123 in #76
- [Feature] Add Gradio Demo for Qwen2.5Omni by @SamitHuang in #60
- [CI] Buildkite setup by @ywang96 in #83
- [CI]Add version number. by @congw729 in #87
- [fix] Remove redundant parameter passing by @qibaoyuan in #90
- [Docs] optimize and supplement docs system by @Gaohan123 in #86
- [Diffusion] Qwen image support by @ZJY0516 in #82
- [fix] add scheduler.py by @ZJY0516 in #94
- Update gradio docs by @SamitHuang in #95
- [Bugfix] Fix removal of old logs when stats are enabled by @syedmba in #84
- [diffusion] add doc and fix qwen-image by @ZJY0516 in #96
- Simple test from PR#88 on Buildkite by @ywang96 in #93
- [Diffusion] Support Multi-image Generation and Add Web UI Demo for QwenImage by @SamitHuang in #97
- [Doc] Misc documentation polishing by @ywang96 in #98
- [Feature] add support for Qwen3-omni by @R2-Y in #55
- [Bugfix] Fix special token
nothinknaming. by @ywang96 in #107 - [Fix] fix qwen3-omni example by @ZJY0516 in #109
- [CI] Fix ci by @ZJY0516 in #110
- [Docs] Add qwen image missing doc in user guide by @SamitHuang in #111
- [Bug-fix] Fix Bugs in Qwen3/Qwen2.5 Omni Rebased Support by @tzhouam in #114
- [Bugfix] Remove mandatory flash-attn dependency and optimzie docs by @Gaohan123 in #113
- [Feat] Add NPU Backend support for vLLM-Omni by @gcanlin in #89
- [Feature] Support Gradio Demo for Qwen3-Omni by @SamitHuang in #116
- [Feat] Enable loading local Qwen-Image model by @gcanlin in #117
- [Bugfix] Fix bug of online serving for qwen2.5-omni by @Gaohan123 in #118
- [Doc] Fix readme typos by @hsliuustc0106 in #108
- [Feat] Rename AsyncOmniLLM -> AsyncOmni by @congw729 in #103
- [Bugfix] Fix Qwen-omni Online Inference Bug caused by check_stop and long sequence by @SamitHuang in #112
- [Fix] Resolve comments & update vLLM-Omni name usages. by @congw729 in #122
- Refresh supported models and address nits in doc by @Yikun in #119
- [Doc] Cleanup non-english comments by @ywang96 in #125
- [Doc] Fix outdated CONTRIBUTING link by @DarkLight1337 in #127
- [Misc] Update default stage config for qwen3-omni by @ywang96 in #124
- [Doc] Cleanup reference to deleted files by @ywang96 in #134
- [Doc] Fix arch pic reference by @ywang96 in #136
- [Bugfix] Fix redundant shm broadcast warnings in diffusion workers by @SamitHuang in #133
- Update README with vllm-omni blogpost link by @youkaichao in #137
- [Bugfix] Fix the curl bug of qwen3-omni and doc errors by @Gaohan123 in #135
- [Doc] Update developer & user channel by @ywang96 in #138
- [Misc][WIP] Support qwen-omni online inference with local video/audio/image path by @SamitHuang in #131
- [Doc] Logo by @ywang96 in #143
- [Misc] Misc description updates by @ywang96 in #146
- [Bugfix] Fix Qwen3-Omni gradio audio input bug by @SamitHuang in #147
- [Bugfix] Add Fake VllmConfig on NPU and add slicing/tiling args in Qwen-Image by @gcanlin in #145
- [Misc] Temporarily support downloading models from ModelScope by snapshot download by @MengqingCao in #132
- [Misc] update image reference for PyPI by @ywang96 in #150
New Contributors
- @tzhouam made their first contribution in #12
- @qibaoyuan made their first contribution in #52
- @SamitHuang made their first contribution in #49
- @ZJY0516 made their first contribution in #71
- @ywang96 made their first contribution in #83
- @syedmba made their first contribution in #84
- @R2-Y made their first contribution in #55
- @gcanlin made their first contribution in #89
- @Yikun made their fir...