Skip to content

Releases: sgl-project/sglang

v0.5.9

24 Feb 01:14
bbe9c7e

Choose a tag to compare

Highlights

  • LoRA Weight Loading Overlap with Computation: Overlap LoRA weight loading with computation during inference, reducing TTFT by ~78% and TPOT by ~34.88% on large adaptors: #15512

  • TRT-LLM NSA Kernel Integration for DeepSeek V3.2: Integrate TRT-LLM DSA kernels for Native Sparse Attention, boosting DeepSeek V3.2 performance by 3x-5x on Blackwell platforms with trtllm for both --nsa-prefill-backend and --nsa-decode-backend
    (with minor accuracy drop): #16758, #17662, #18389

  • Flashinfer All-to-All MoE Dispatcher: Add the Flashinfer all-to-all MoE dispatcher for efficient expert parallelism communication, enabling optimized routing in MoE models: #14668

  • FA4 (FP4 Attention) Support for Multimodal Encoder: Introduce FP4 attention backend and variable-length attention function for multimodal encoders, enabling lower-precision inference for vision-language models: #13539

  • Anthropic Compatible API Endpoint: Add native Anthropic API compatibility to SGLang, allowing direct integration with tools and clients built for the Anthropic API format: #18630

  • SGLang-Diffusion Advanced Optimizations: Production-ready improvements including token-level sequence sharding, parallel VAE decoding, fused kernels, Nunchaku and FP8 support, and multiple new models in the ComfyUI plugin: blog

  • Spec V2 Critical bug fix: Fix out-of-index bug caused by torch garbage collection in speculative decoding v2, improving reliability of speculative verification: #18958

  • Deploying DeepSeek on GB300 NVL72: Optimization work for long-context inference using prefill-decode disaggregation and other SGLang features on NVIDIA's latest GB300 platform: blog

  • Bump AITER version to 0.1.10.post3: Support FP8 Prefill/Decode/KV Cache

  • Commit-to-Version Lookup in docs.sglang.io: Easily find the earliest official version that includes a given PR or commit, streamlining release tracking for users and developers: #18450, link

New Model Support

SGLang-Diffusion

  • Support multiple new models in ComfyUI Plugin
  • Parallel Folding and Parallel VAE Decoding for faster image/video generation
  • Nunchaku and FP8 support for diffusion models
  • Sequence Sharding (token-level) replacing Frame Sharding for improved efficiency
  • LTX-2 support: #17495, #17496
  • MOVA model support: #17704
  • Cache-DiT optimizations and fused kernel improvements
  • Numerous bug fixes and refactors across the diffusion pipeline

Performance

  • Integrate TRT-LLM NSA kernels with up to 3-5x speedup on Blackwell: #16758, #17662, #18389
  • LoRA weight loading overlap reducing TTFT by ~78%: #15512
  • Flashinfer all-to-all MoE dispatcher: #14668
  • FA4 for multimodal encoder: #13539
  • Optimize GDN decode for Qwen3 Next: #17094
  • Tune fused MoE kernels for Llama-4-Scout, MiniMax M2: #17891, #18851, #18833
  • Symmetric memory pre-allocation to avoid fragmentation: #17089
  • Optimize fused_moe triton kernel TMA: #18782
  • Fused triton kernel for Ernie4.5-VL rotary embedding: #18856
  • Support MxINT4 Flashinfer TRT-LLM MoE GEMM: #16892
  • AITER bias MoE support for GPT-OSS MxFP4: #17735

Prefill-Decode Disaggregation

  • Support KV transfer with MORI-IO: #14626
  • Mooncake intra-node NVLink KV transfer: #17866
  • Improve KV offset calculation for MHA model with different TP size: #18163
  • Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL: #18259

Diffusion LLM (dLLM)

  • Remove cuda graph batch size limitation: #17458
  • JointThreshold algorithm for joint M2T and T2T decoding: #18171
  • Basic dLLM scheduling strategy and implementation: #17484

Speculative Decoding

  • Fix out-of-index bug caused by torch garbage collection in Spec V2: #18958
  • Move forward timeout before verify to fix Eagle v1 filter mismatch: #18760

Dependencies

  • Flashinfer updated to 0.6.3: #17700
  • AITER updated to 0.1.10.post3: #18741
  • Mooncake transfer engine updated to 0.3.9: #18316

AMD Hardware

  • AITER updated to v0.1.10.post3 with FP8 Prefill, FP8 Decode, FP8 KV Cache support
  • ROCm 7 standardization and ROCm 6.3 deprecation: #17785
  • Kimi K2.5 Day 0 ROCm support: #17863
  • FP8 prefill attention kernel integration: #18528
  • Two-batch overlapping for MORI EP: #17953
  • DeepSeek V3.2 and Kimi-K2 nightly CI tests: #17523

NPU/Ascend

  • Support for MiniCPM3-4B: #16866
  • Qwen 3.5 support on Ascend: #18544
  • Accuracy improvements for StableLM-2: #17470
  • Bug fixes for DeepSeek V3.2 and DeepSeek-VL2: #17007

CPU Backend

  • Optimize Qwen3-Next model on CPU: #12525
  • Optimize flash_attn_varlen_func: #15708
  • Add INT4 kernels for CPU: #8226

Kernel Slimming

  • Migrate GPTQ-Marlin repack kernel to JIT: #18543
  • Migrate AWQ Marlin repack kernel to JIT: #18949

Documentation

  • Add RL documentation: #17663
  • Update torch compile description: #17819
  • Refine spec decode docs for SpecV2/STANDALONE/NGRAM: #18321
  • Consolidate diffusion documentation: #18095

What's Changed

Read more

v0.5.8

23 Jan 22:09

Choose a tag to compare

Highlights

New Model Support

DeepSeek V3.2 Optimization

  • Context Parallelism Optimization with support for fused MoE, multi-batch, and FP8 KV cache: #13959

Flash Attention 4

  • Support for Flash Attention 4 decoding kernels: #16034

SGLang-Diffusion

  • Run sglang-diffusion with diffusers backend
  • Features: Multi-LoRA inference, SLA attention backends, warmup switch in CLI, ComfyUI Plugin
  • Performance improvements for all models

Dependencies

  • sgl-kernel updated to 0.3.21: #17075
  • Cutedsl updated to 4.3.4: #17075
  • Added dependencies for tvm-ffi and quack-kernels: #17075
  • Flashinfer updated to 0.6.1: #15551
  • Mooncake transfer engine updated to 0.3.8.post1: #16792

Security

  • Fixed urllib and gpgv vulnerabilities: #17439

What's Changed

Read more

Release Gateway-v0.3.1

09 Jan 06:18
7460240

Choose a tag to compare

🚀 SMG v0.3.1 Released!

We're excited to announce SMG v0.3.1 – a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security!

🌲 Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory ⚡

Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains:

Performance Improvements

  • Our cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900), with latency dropping from 52.9 microseconds to just 4.6 microseconds per operation.
  • For prefix matching across 10,000 tree entries, throughput jumped from 41,000 to 124,000 operations per second.
  • Under concurrent load with 64 threads, the system processes 474,000 operations per second – a 7.9x improvement over the previous 59,000 ops/sec.

Data processing

  • INSERT operations now process 440 MB/s (up from 38 MB/s),
  • MATCH operations handle 253 MB/s (up from 83 MB/s).

Memory Improvements:

  • ~99% memory reduction per tree node:
  • Before: ~180 KB per node (DashMap default config on 170-core machines)
  • After: ~1.4 KB per node
    Result: Deploy 100x more cache entries in the same memory footprint!
    For a typical deployment with 10,000 cached prefixes, memory usage drops from ~1.8 GB to just ~14 MB – freeing up resources for actual inference workloads.
    Impact: Cache-aware routing is now 10-12x faster and uses 99% less memory. This is critical for large-scale multi-tenant deployments.

🔐 JWT/OIDC Authentication

Production-grade security for control plane APIs with native support for industry-standard OIDC providers: Google, Azure, Oracle, GitHub, and more. Protect tokenizer management, worker registration, and admin endpoints with enterprise authentication infrastructure you already use. Critical for enterprise deployments – seamlessly integrate SMG into your existing identity and access management systems.

📊 Classification API Support

Native support for classification workloads! Deploy and serve classification models alongside your existing inference fleet with dedicated pipeline stages and protocol types.

✨ Additional Features

  • PrefixHash Load Balancing: New KV cache-aware load balancing policy using prefix hashing for improved cache hit rates in multi-tenant environments.
  • Nemotron Nano V3 Parser
  • In-Flight Request Age Metrics: Track request age in-flight for better observability and SLA monitoring.

🛠️ Enhancements

Developer Experience:

  • Organized CLI arguments into logical groups
  • Shortened logging targets (sgl_model_gateway → smg)
  • Comprehensive embedding correctness tests against HuggingFace
  • Auto-generate protobuf files during wheel build

Reliability:

  • Fix IGW routing for external OpenAI workers
  • Work around orphan process problems
  • Prevent potential hangs in subprocess handling
  • Use 504 Gateway Timeout for upstream timeouts (proper HTTP semantics)

🐛 Bug Fixes

  • Fixed embedding worker health check crash
  • Fixed tokenizer to match transformers special token handling
  • Fixed age bucket rendering issue
  • Fixed non-PD router HTTP header whitelist
  • Fixed duplicate classify prefix in response ID
  • Fixed WASM test errors on machines with many cores

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (120 commits)

Read more

v0.5.7

01 Jan 10:01
232982a

Choose a tag to compare

Highlights

What's Changed

Read more

Release Gateway-v0.3.0

24 Dec 22:00
5454d2a

Choose a tag to compare

🚀 SGLang Model Gateway v0.3.0 Released!

We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes!

⚠️ Breaking Changes

📊 Metrics Architecture Redesigned

Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes.
Action Required: Update your Prometheus dashboards and alerting rules. Metric names and structure have changed.

🔧 UUID-Based Worker Resource Management

Workers are now identified by UUIDs instead of endpoints for cleaner resource management.
Action Required: Update any tooling or scripts that interact with the worker API.

✨ New Features

🌐 Unified Inference Gateway Mode (IGW)

Single gateway, entire fleet. IGW now supports ALL router types in a single deployment with Kubernetes service discovery:

  • gRPC router (PD and regular mode)
  • HTTP router (PD and regular mode)
  • OpenAI router
    Auto-enabled with service discovery. Deploy once, route everything - handle all traffic patterns across your entire inference fleet from a single gateway instance.

🔤 Tokenize/Detokenize HTTP Endpoints

  • Direct HTTP endpoints for tokenization operations
  • Dynamic tokenizer control plane: add, list, get, and remove tokenizers on-the-fly
  • TokenizerRegistry for efficient dynamic loading

🧠 Parser Endpoints

  • /parse/reasoning - Parse reasoning outputs
  • /parse/function_call - Parse function call responses
  • GLM-4 function call parser - Contributed directly by the GLM team for latest GLM models

📊 Embeddings Support

Native embeddings endpoint for gRPC router - expand beyond text generation to embedding workloads.

🔐 Server-Side TLS Support

Secure your gateway deployments with native TLS support.

🌐 Go Implementation, contributed by iFlytek MaaS team.

Complete Go SGLang Model Gateway with OpenAI-compatible API server - bringing SGLang to the Go ecosystem!

⚡ Major Enhancements

Control Plane - Workflow Engine

Intelligent lifecycle orchestration with:

  • DAG-based parallel execution with pre-computed dependency graphs
  • Concurrent event processing for maximum throughput
  • Modular add/remove/update workflows

Performance Optimization

  • Lock-free data structures: DashMap for policy lookups, lock-free router snapshots
  • Reduced CPU overhead: Optimized worker registry, gRPC client fetch, and worker selection
  • Optimized router management: Improved selection algorithms and state management

Resilience & Reliability:

  • Retry and circuit breaker support for OpenAI and gRPC routers
  • Enhanced circuit breaker with better state management
  • Graceful shutdown for TLS and non-TLS servers
  • Unified error responses with error codes and X-SMG-Error-Code headers

Infrastructure:

  • Multi-architecture Docker builds (Linux, macOS, Windows, ARM)
  • Custom Prometheus duration buckets
  • Improved logging across all modules

🐛 Bug Fixes & Stability

  • Fixed cache-aware routing in gRPC mode
  • Resolved load metric tracking and double-decrease issues for cache aware load balancing
  • Improved backward compatibility for GET endpoints
  • Fixed gRPC scheduler launcher issues
  • Fixed token bucket negative duration panics
  • Resolved MCP server initialization issues

📚 Documentation

Major documentation update with comprehensive guides, examples, and best practices for SGLang Model Gateway.

⚠️ Migration checklist:

  • Update Prometheus dashboards for new metrics
  • Update worker API integrations for UUID-based management
  • Review new error response format

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (108 commits)

Read more

Release Gateway-v0.2.4

10 Dec 01:09
390406c

Choose a tag to compare

🚀 SGLang Model Gateway v0.2.4 Released!

We're excited to announce SGLang Model Gateway v0.2.4 – a massive release focused on performance, security, and production-ready observability!

✨ Headline Features

⚡ Major Performance Optimizations

We've invested heavily in performance across the entire stack:

  • Optimized radix tree for cache-aware load balancing – Smarter routing decisions with lower overhead
  • Tokenizer optimization – Dramatically reduced CPU and memory footprint during tokenization
  • Core module optimization – HTTP and gRPC routers now run leaner and faster
  • Efficient OTEL implementation – Production-grade observability with minimal performance impact

🔌 Industry-First WASM Middleware Support

Programmable middleware using WebAssembly! Extend your gateway with safe, isolated plugins. Build custom routing logic, transform requests/responses, or integrate proprietary systems – all without touching core code. Your gateway, your rules.

📊 Production-Grade Observability

Full OpenTelemetry integration with distributed tracing for both HTTP and gRPC. Track requests across your entire inference stack with native trace context propagation. Finally, real visibility into your LLM infrastructure.

⚡ Built for speed. Hardened for security. Ready for production.

Gateway Changes (98 commits)

Read more

Release v0.5.6

03 Dec 05:11
7ae368e

Choose a tag to compare

Highlights

  • Support for DeepSeek V3.2/V3.2 Speciale #14249
  • Blockwise diffusion language model support #12588
  • Support for new diffusion models (Flux2 #14000, Z-image #14067)
  • Introduce JIT Kernels #13453
  • Upgrade to Torch 2.9 #12969
  • Kimi-K2-Thinking model enhancement #12882
  • Memory management/Overlap spec compatibility #12224 #12839
  • More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2...
  • CI/CD Enhancement

What's Changed

Read more

Release Gateway-v0.2.3

17 Nov 11:23
172c71a

Choose a tag to compare

🚀 SGLang Model Gateway - New Release!

We're excited to announce another powerful update to SGLang Model Gateway with performance improvements and expanded database support!

Headline Features

⚡ Bucket Mode Routing - 20-30% Performance Boost
Introducing our new bucket-based routing algorithm that dramatically improves performance in PD mode. See up to 20-30% improvements in TTFT (Time To First Token) and overall throughput

💾 PostgreSQL Support for Chat History Management
Flexibility in data storage! We now support PostgreSQL alongside OracleDB and in-memory storage for chat history management.

🛠️ Enhanced Model Tool & Structured Output Support

  • MinMax M2 model support!
  • Structured model output for OpenAI and gRPC router
  • Streaming parsing with Tool Choice in chat completions API
  • Tool_choice support for Responses API
  • OutputItemDone events with output item array storage for better observability

🐛 Stability & Quality Improvements

Multiple bug fixes for model validation, streaming logic, reasoning content indexing, and CI stability enhancements.

🔧 Code Quality Enhancements

Refactored builders for chat and responses, restructured modules for better maintainability, and consolidated error handling.

Try the latest version: pip install sglang-router --upgrade

What's Changed in Gateway

Gateway Changes (45 commits)

New Contributors

Paths Included

  • sgl-router
  • python/sglang/srt/grpc
  • python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.2...gateway-v0.2.3

Release v0.5.5

06 Nov 17:54
0c006b8

Choose a tag to compare

Highlights

What's Changed

Read more

Release Gateway-v0.2.2

17 Nov 11:19
6237754

Choose a tag to compare

🚀 SGLang Model Gateway v0.2.2 Released!

Features

🎯 Industry-First Responses API for All Models
We're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for Llama, DeepSeek, Qwen, and more – with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models.

☸️ Production-Ready Kubernetes Operations
Taking large-scale deployments seriously! We now support native gRPC health check endpoints, making it effortless to deploy and operate SGLang at scale on Kubernetes with proper health monitoring and orchestration.

🔐 Your Network, Your Control

  • mTLS Support: Secure gateway-to-SGLang communication whether you're running on edge, remote cloud, multi-cloud, or hybrid environments – we've got you covered
  • MCP Proxy Enhancements: Configure proxies globally or per-individual MCP server – complete network control in your hands

🤖 Harmony Pipeline
Introducing our unified OpenAI-native architecture with GPT OSS model support for both Responses API and Chat Completion – fully integrated with MCP and intelligent storage management.

🌍 Universal Platform Support
A major leap in accessibility! SGLang Model Gateway now runs on nearly every operating system and architecture: Linux, Windows, Mac, x86, and ARM. Even better – we support all Python versions from 3.8 to 3.14 in a single wheel file, while reducing wheel size by more than 40%. Deploy anywhere, on any Python version, with unprecedented efficiency!

⚡ Additional Enhancements

  • Multi-worker URL support for better load distribution
  • Connection pooling and tool inventory for MCP
  • Native OpenAI web search tool support and function calling for OpenAI router

🐛 Stability Improvements

We've squashed numerous bugs including background task handling, tool call IDs, conversation management, and installation dependencies.

Try it now: pip install sglang-router==0.2.2


What's Changed in Gateway

Gateway Changes (48 commits)

New Contributors

Paths Included

  • sgl-router
  • python/sglang/srt/grpc
  • python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.1...gateway-v0.2.2