Skip to content

Releases: vllm-project/vllm-omni

v0.12.0rc1

05 Jan 11:17
e7eeb54

Choose a tag to compare

v0.12.0rc1 Pre-release
Pre-release

vLLM-Omni v0.12.0rc1 Pre-Release Notes Highlights

Highlights

This release features 187 commits from 45 contributors (34 new contributors)!

vLLM-Omni v0.12.0rc1 is a major RC milestone focused on maturing the diffusion stack, strengthening OpenAI-compatible serving, expanding omni-model coverage, and improving stability across platforms (GPU/NPU/ROCm). It also rebases on vLLM v0.12.0 for better alignment with upstream (#335).

Breaking / Notable Changes

  • Unified diffusion stage naming & structure: cleaned up legacy Diffusion* paths and aligned on Generation*-style stages to reduce duplication (#211, #163).
  • Safer serialization: switched OmniSerializer from pickle to MsgPack (#310).
  • Dependency & packaging updates: e.g., bumped diffusers to 0.36.0 (#313) and refreshed Python/formatting baselines for the v0.12 release (#126).

Diffusion Engine: Architecture + Performance Upgrades

  • Core refactors for extensibility: diffusion model registry refactored to reuse vLLM’s ModelRegistry (#200), improved diffusion weight loading and stage abstraction (#157, #391).

  • Acceleration & parallelism features:

    • Cache-DiT with a unified cache backend interface (#250)
    • TeaCache integration and registry refactors (#179, #304, #416)
    • New/extended attention & parallelism options: Sage Attention (#243), Ulysses Sequence Parallelism (#189), Ring Attention (#273)
    • torch.compile optimizations for DiT and RoPE kernels (#317)

Serving: Stronger OpenAI Compatibility & Online Readiness

  • DALL·E-compatible image generation endpoint (/v1/images/generations) (#292), plus online serving fixes for image generation (#499).
  • Added OpenAI create speech endpoint (#305).
  • Per-request modality control (output modality selection) (#298) with API usage examples (#411).
  • Early support for streaming output (#367), request abort (#486), and request-id propagation in responses (#301).

Omni Pipeline: Multi-stage Orchestration & Observability

  • Improved inter-stage plumbing: customizable process between stages and reduced coupling on request_ids in model forward paths (#458).
  • Better observability and debugging: torch profiler across omni stages (#553), improved traceback reporting from background workers (#385), and logging refactors (#466).

Expanded Model Support (Selected)

  • Qwen-Omni / Qwen-Image family:

    • Qwen-Omni offline inference with local files (#167)
    • Qwen-Image-2512 support(#547)
    • Qwen-Image-Edit support (including multi-image input variants and newer releases, Qwen-Image-Edit Qwen-Image-Edit-2509 Qwen-Image-Edit-2511) (#196, #330, #321)
    • Qwen-Image-Layered model support (#381)
    • Multiple fixes for Qwen2.5/Qwen3-Omni batching, examples, and OpenAI sampling parameter compatibility (#451, #450, #249)
  • Diffusion / video ecosystem:

    • Z-Image support and kernel fusions (#149, #226)
    • Stable Diffusion 3 support (#439)
    • Wan2.2 T2V plus I2V/TI2V pipelines (#202, #329)
    • LongCat-Image and LongCat-Image-Edit support (#291, #392)
    • Ovis Image model addition (#263)
    • Bagel (diffusion-only) and image-edit support (#319, #588)

Platform & CI Coverage

  • ROCm / AMD: documented ROCm setup (#144) and added ROCm Dockerfile + AMD CI (#280).
  • NPU: added NPU CI workflow (#231) and expanded NPU support for key Omni models (e.g., Qwen3-Omni, Qwen-Image series) (#484, #463, #485), with ongoing cleanup of NPU-specific paths (#597).
  • CI and packaging improvements: diffusion CI, wheel compilation, and broader UT/E2E coverage (#174, #288, #216, #168).

What's Changed

Read more

0.11.0rc1

01 Dec 18:14
9fe730a

Choose a tag to compare

0.11.0rc1 Pre-release
Pre-release

Initial (Pre)-release of the vLLM-Omni Project

vLLM-Omni is a framework that extends its support for omni-modality model inference and serving. This pre-release is built on top of vllm==0.11.0, and same version number is used for the ease of tracking the dependency.

Please check out our documentation and we welcome any feedbacks & contributions!

What's Changed

  • init the folder directories for vLLM-omni by @hsliuustc0106 in #1
  • init main repo structure and demonstrate the AR + DiT demo for omni models by @hsliuustc0106 in #6
  • Add PR and issue templates from vLLM project by @hsliuustc0106 in #8
  • update RFC template by @hsliuustc0106 in #9
  • [Model]Add Qwen2.5-Omni model components by @tzhouam in #12
  • [Engine] Add entrypoint class and stage management by @Gaohan123 in #13
  • [Model] Add end2end example and documentation for qwen2.5-omni by @Gaohan123 in #14
  • [Worker]Feat/ar gpu worker and model runner by @tzhouam in #15
  • [Worker]Refactor GPU diffusion model runner and worker by @tzhouam in #16
  • [Worker]Add OmniGPUModelRunner and OmniModelInputForGPU classes by @tzhouam in #17
  • [Engine]Refactor output processing for multimodal capabilities in vLLM-omni by @tzhouam in #20
  • [Inputs, Engine]Add Omni model components and input processing for hidden states support by @tzhouam in #18
  • [Core]Add scheduling components for vLLM-omni by @tzhouam in #19
  • add precommit by @Gaohan123 in #32
  • End2end fixup by @tzhouam in #35
  • Remove unused files and fix some bugs by @Gaohan123 in #36
  • [bugfix] fix problem of installation by @Gaohan123 in #44
  • [Bugfix] Further supplement installation guide by @Gaohan123 in #46
  • [Bugfix] fix huggingface download problem for spk_dict.pt by @Gaohan123 in #47
  • [Refractor] Dependency refractored to vLLM v0.11.0 by @Gaohan123 in #48
  • [fix] Add support for loading model from a local path by @qibaoyuan in #52
  • [Feature] Multi Request Stream for Sync Mode by @tzhouam in #51
  • [Docs] Setup Documentation System and Re-organize Dependencies by @SamitHuang in #49
  • [fix] adapt hidden state device for multi-hardware support by @qibaoyuan in #61
  • [Feature] Support online inference by @Gaohan123 in #64
  • CI Workflows. by @congw729 in #50
  • [CI] fix ci and format existing code by @ZJY0516 in #71
  • [CI] disable unnecessary ci and update pre-commit by @ZJY0516 in #80
  • update readme for v0.11.0rc1 release by @hsliuustc0106 in #69
  • [CI] Add script for building wheel. by @congw729 in #75
  • [Feature] support multimodal inputs with multiple requests by @Gaohan123 in #76
  • [Feature] Add Gradio Demo for Qwen2.5Omni by @SamitHuang in #60
  • [CI] Buildkite setup by @ywang96 in #83
  • [CI]Add version number. by @congw729 in #87
  • [fix] Remove redundant parameter passing by @qibaoyuan in #90
  • [Docs] optimize and supplement docs system by @Gaohan123 in #86
  • [Diffusion] Qwen image support by @ZJY0516 in #82
  • [fix] add scheduler.py by @ZJY0516 in #94
  • Update gradio docs by @SamitHuang in #95
  • [Bugfix] Fix removal of old logs when stats are enabled by @syedmba in #84
  • [diffusion] add doc and fix qwen-image by @ZJY0516 in #96
  • Simple test from PR#88 on Buildkite by @ywang96 in #93
  • [Diffusion] Support Multi-image Generation and Add Web UI Demo for QwenImage by @SamitHuang in #97
  • [Doc] Misc documentation polishing by @ywang96 in #98
  • [Feature] add support for Qwen3-omni by @R2-Y in #55
  • [Bugfix] Fix special token nothink naming. by @ywang96 in #107
  • [Fix] fix qwen3-omni example by @ZJY0516 in #109
  • [CI] Fix ci by @ZJY0516 in #110
  • [Docs] Add qwen image missing doc in user guide by @SamitHuang in #111
  • [Bug-fix] Fix Bugs in Qwen3/Qwen2.5 Omni Rebased Support by @tzhouam in #114
  • [Bugfix] Remove mandatory flash-attn dependency and optimzie docs by @Gaohan123 in #113
  • [Feat] Add NPU Backend support for vLLM-Omni by @gcanlin in #89
  • [Feature] Support Gradio Demo for Qwen3-Omni by @SamitHuang in #116
  • [Feat] Enable loading local Qwen-Image model by @gcanlin in #117
  • [Bugfix] Fix bug of online serving for qwen2.5-omni by @Gaohan123 in #118
  • [Doc] Fix readme typos by @hsliuustc0106 in #108
  • [Feat] Rename AsyncOmniLLM -> AsyncOmni by @congw729 in #103
  • [Bugfix] Fix Qwen-omni Online Inference Bug caused by check_stop and long sequence by @SamitHuang in #112
  • [Fix] Resolve comments & update vLLM-Omni name usages. by @congw729 in #122
  • Refresh supported models and address nits in doc by @Yikun in #119
  • [Doc] Cleanup non-english comments by @ywang96 in #125
  • [Doc] Fix outdated CONTRIBUTING link by @DarkLight1337 in #127
  • [Misc] Update default stage config for qwen3-omni by @ywang96 in #124
  • [Doc] Cleanup reference to deleted files by @ywang96 in #134
  • [Doc] Fix arch pic reference by @ywang96 in #136
  • [Bugfix] Fix redundant shm broadcast warnings in diffusion workers by @SamitHuang in #133
  • Update README with vllm-omni blogpost link by @youkaichao in #137
  • [Bugfix] Fix the curl bug of qwen3-omni and doc errors by @Gaohan123 in #135
  • [Doc] Update developer & user channel by @ywang96 in #138
  • [Misc][WIP] Support qwen-omni online inference with local video/audio/image path by @SamitHuang in #131
  • [Doc] Logo by @ywang96 in #143
  • [Misc] Misc description updates by @ywang96 in #146
  • [Bugfix] Fix Qwen3-Omni gradio audio input bug by @SamitHuang in #147
  • [Bugfix] Add Fake VllmConfig on NPU and add slicing/tiling args in Qwen-Image by @gcanlin in #145
  • [Misc] Temporarily support downloading models from ModelScope by snapshot download by @MengqingCao in #132
  • [Misc] update image reference for PyPI by @ywang96 in #150

New Contributors

Read more