Releases · sgl-project/sglang

26 Oct 02:37

hnyls2002

v0.5.4

1053e1b

Release v0.5.4

Highlights

AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
[beta] Overlap scheduler for speculative decoding: #11762
[beta] Piecewise CUDA graph for prefill: #11490
Prefix cache for qwen3 next and GDN/mamba models: #11214
Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, #11989)
Various Blackwell kernel optimizations
DGX Spark Support: https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
KTransformer integration: https://lmsys.org/blog/2025-10-22-KTransformers/
New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
Native ModelOpt quantization support

What's Changed

[router] add ipv6 support across all components by @slin1237 in #11219
Remove env var warnings for release by @merrymercy in #11262
Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
[router][tool call] Clean up redundant detect_format and has_tool_markers by @CatherineSue in #11270
disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
docker: add manifest to versioned docker releases by @ishandhanani in #11268
[Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
[router][grpc] Refine streaming processes by @CatherineSue in #11277
Fix code sync scripts by @merrymercy in #11276
[Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
docs: update sgl-kernel README by @zhyncs in #11286
chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
[router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
convert test_deterministic into unit tests by @skyzh in #11095
Feature/longbench v2 evaluation utils by @alhridoy in #10949
[ci] fix pp test by @hnyls2002 in #11294
EAGLE cache fix for SWARadixCache by @ispobock in #11231
Remove overlap thread by @hnyls2002 in #11210
[router] add reasoning and tool parser argument in router by @slin1237 in #11290
Remove sampling info events and overlap thread file by @hnyls2002 in #11300
Introduce future indices by @hnyls2002 in #11301
[sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
[Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
[router] add get server info and get model info in grpc server by @slin1237 in #11303
[router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
[Doc] HiCache Design Documents by @ykwd in #11027
[Doc]: Best Practice for HICache by @hzh0425 in #11001
[router] fix grpc connection conversion and add optimization by @slin1237 in #11305
[router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
Update tool parser and related documentation by @JustinTong0323 in #11223
[router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
[quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
[router] support Openai router conversation API CRUD by @key4ng in #11297
[router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
[router] cleanup worker health check to return early by @slin1237 in #11310
[oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults by @CatherineSue in #11304
Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
ci: unify the model launch method of nightly ci by @mickqian in #11230
[Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
update sampling_params documentation with defaults by @JustinTong0323 in #11315
Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
Rename ngram_utils -> ngram_info by @hnyls2002 in #11316
[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
[8/N] MoE Refactor: deprecate EPMoE by @ch-wan in #11211
Skip weight loading in deepgemm compilation by @ch-wan in #11312
[2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
[Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
Support LoRA in bench_serving oai interface by @lifuhuang in #11318
benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
[CI] improve disaggregation CI. by @hnyls2002 in #11264
model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
[router] refactor generate to use new pipeline arch by @slin1237 in #11323
[router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
[router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
[router] Fix all unused_qualifications by @CatherineSue in #11341
[router] Support history management using conversation by @key4ng in #11339
[router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
[Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
[Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
[router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
[router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
[router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
[router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in https://github.c...

Contributors

tazjin, antoine-roux, and 123 other contributors

Assets 2

17 Nov 11:13

slin1237

gateway-v0.2.1

8a801ee

Release Gateway-v0.2.1

🚀 SGLang Model Gateway v0.2.1 Released!

This release focuses on stability, cleanup, and two big new performance features.

🧾 Docs & CI

Updated router documentation to reflect recent feature additions

🧹 Code Cleanup

Refactored StopSequenceDecoder for cleaner incremental decoding
Added spec.rs test harness under spec/ for structured unit tests

🐞 Bug Fixes

Fixed UTF-8 boundary in stop-sequence decoding
Fixed gRPC timeout configuration
Fixed worker filtering, tool-choice normalization, and bootstrap-port handling
Additional gRPC server warm-up and concurrency fixes

🌟 New Features

Two-Level Tokenizer Caching (L0 + L1)
L0: exact-match cache for repeated prompts
L1: prefix-aware cache at special-token boundaries
OpenAI-Style Classification API → new /v1/classifications endpoint, shout out to yanbo for the contribution
Worker Management Workflow Engine → improved async registration, worker self discovery, and health orchestration

What's Changed in Gateway

Gateway Changes (26 commits)

[router] release router 0.2.1 (#11885) by @slin1237 in #11885
[router][grpc] Fix wram-up random token ids for small models (#11887) by @CatherineSue in #11887
[router] clean up workflow logs to debug for implementation details logs (#11886) by @slin1237 in #11886
fix(sql-router): fix conflict port in test (#11826) by @htiennv in #11826
[router][grpc] Remove continue_final_message in ChatTemplateParams and add minijinja-contrib (#11882) by @CatherineSue in #11882
[router] remove encoding header for oai router (#11881) by @slin1237 in #11881
[router] Worker Management Workflow Engine (#11868) by @slin1237 in #11868
[2/2] [feature] support openai like classification api in router (#11670) by @whybeyoung in #11670
[router] Add Configurable L0 and L1 Tokenizer Caching (#11688) by @slin1237 in #11688
[router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client (#11798) by @CatherineSue in #11798
[Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685) by @CatherineSue in #11685
[router][grpc] Remove timeout for connections and remove max_tokens deprecation warning log (#11775) by @CatherineSue in #11775
[doc] update router document (#11767) by @key4ng in #11767
[router] fix grpc client time out to 1h (#11768) by @slin1237 in #11768
[router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder (#11766) by @slin1237 in #11766
Revert "[router] fix get_models endpoint for openai router (#11687)" (#11740) by @key4ng in #11687
[router] Add rustfmt and set group imports by default (#11732) by @CatherineSue in #11732
[router] add spec.rs to enables tests under spec folder (#11734) by @key4ng in #11734
[router] Fix tool_choice normalization in ChatCompletionRequest and fix ut (#11731) by @CatherineSue in #11731
[router][grpc] add dissag info to warm up in grpc server (#11727) by @slin1237 in #11727
[router] fix p and d worker filtering and bootstrap port handling (#11729) by @slin1237 in #11729
[Router] Refactor protocol definitions: split spec.rs into modular files (#11677) by @key4ng in #11677
[router] fix get_models endpoint for openai router (#11687) by @key4ng in #11687
[router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676) by @slin1237 in #11676
[router][grpc] Simplify model_id determination (#11684) by @CatherineSue in #11684
[router] Fix response api related spec (#11621) by @key4ng in #11621

Paths Included

sgl-router
python/sglang/srt/grpc
python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.0...gateway-v0.2.1

Contributors

CatherineSue, whybeyoung, and 3 other contributors

Assets 2

17 Nov 11:03

slin1237

gateway-v0.2.0

74737b2

Release Gateway-v0.2.0

🚀 Release: SGLang Model Gateway v0.2.0 (formerly “SGLang Router”)

🔥 What’s new

🧠 Multi-Model Inference Gateway (IGW) Mode

IGW turns one router into many — letting you manage multiple models at once, each with its own routing policy, priorities, and metadata. Think of it as running several routers under one roof, with shared reliability, observability, and API surface.
You can dynamically register models via /workers, assign labels like tier or policy, and let the gateway handle routing, health checks, and load balancing.
Whether you’re mixing Llama, Mistral, and DeepSeek, or orchestrating per-tenant routing in enterprise setups, IGW gives you total control.
Your fleet, your rules. ⚡

⚡ gRPC Mode: Rust-Powered, Built for Throughput

This is the heart of 0.2.0. The new gRPC data plane runs entirely in Rust — tokenizer, reasoning parser, and tool parser included — giving you native-speed performance, and lower latency.
You can connect to gRPC-based SGLang workers, stream tokens in real time, and even handle OpenAI-compatible APIs like

🌐 OpenAI-Compatible Gateway

Seamlessly proxy requests to OpenAI, while keeping data control local.
Conversation history, responses, and background jobs all flow through the gateway — same API, enterprise privacy.
💾 Pluggable History Storage
Choose between memory, none, or oracle for conversation and /v1/responses data.
memory: Fastest for ephemeral runs.none: Zero persistence, zero latency overhead.oracle: Full persistence via Oracle ATP with connection pooling and credentials support.🧩 Pluggable MCP Integration
The gateway now natively speaks MCP across all transports (STDIO, HTTP, SSE, Streamable), so your tools can plug directly into reasoning and response loops — perfect for agentic workflows and cross-model orchestration.

🛡️ Reliability & Observability Upgrades

Built-in:
Retries with exponential backoff + jitterPer-worker circuit breakersToken-bucket rate limiting & FIFO queuingPrometheus metrics for latency, load, queue depth, PD pipelines, tokenizer speed, and MCP activityStructured tracing & request-ID propagation

✨ SGLang Model Gateway v0.2.0 — built in Rust, designed for scale, ready for reasoning.

What's Changed in Gateway

Gateway Changes (238 commits)

[router] upgrade to 0.2.0 (#11642) by @slin1237 in #11642
[router] add worker self discovery for metadata (#11638) by @slin1237 in #11638
[router][grpc] add warm up to grpc server (#11627) by @slin1237 in #11627
[router] update router readme to latest features (#11619) by @slin1237 in #11619
[router] add chang and keyang to sgl router author (#11620) by @slin1237 in #11620
[router] cleanup app context and move to startup (#11617) by @slin1237 in #11617
[router] add py binding and readme for openai router and history backend (#11453) by @key4ng in #11453
[router] when given both local tokenizer and chat template, log all (#11601) by @slin1237 in #11601
[router] allow router launch server to use grpc mode (#11600) by @slin1237 in #11600
[router] delete useless table content comment in spec (#11597) by @slin1237 in #11597
[router] change worker api to async instead of sync (#11566) by @slin1237 in #11566
[router] update generate spec to align with sgl io struct (#11591) by @slin1237 in #11591
[router][protocols] Add Axum validate extractor and use it for /v1/chat/completions endpoint (#11588) by @CatherineSue in #11588
[router][grpc] Add serve_grpc to launch_server and log id for HealthCheck (#11564) by @CatherineSue in #11564
[router][grpc] Add error handling to generate_tool_constraints (#11562) by @CatherineSue in #11562
[router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483) by @Jonahcb in #11483
[router] allow user to specify chat template path (#11549) by @slin1237 in #11549
[router][grpc] Further delegate non-stream processing to processing.rs (#11553) by @CatherineSue in #11553
[router][Fix] Include grpc reflection runtime dependency (#11419) by @ai-jz in #11419
[router] allow tokenizer path to be dir (#11530) by @slin1237 in #11530
[router] openai router: support grok model (#11511) by @key4ng in #11511
Fix the GPT function calling regex to allow dash in the name (#10577) by @antoine-roux in #10577
[Router]: Small Typo in a comment within tree.rs (#11489) by @xuwenyihust in #11489
Super tiny delete unused openai router in sgl-router (#11448) by @fzyzcjy in #11448
[router][grpc] Consolidate parser checks for chat completions (#11439) by @CatherineSue in #11439
[router] leverage RAII to actively cancel request during client disconnect (#11399) by @slin1237 in #11399
[router] disable rate limiter by default (#11435) by @slin1237 in #11435
[router] Fix ci nvcc not found error (#11411) by @key4ng in #11411
move more files under srt/utils (#11285) by @merrymercy in #11285
[router] conversation item API: create, retrieve and delete (#11369) by @key4ng in #11369
[router] change grpc client from mutable to clone (#11394) by @slin1237 in #11394
[router][grpc] Replace fake health check with correct ones (#11387) by @CatherineSue in #11387
[router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (#11373) by @CatherineSue in #11373
[router][lint] Add unused_qualifications to cargo lint warnings (#11366) by @CatherineSue in #11366
[router] Refactor OpenAI router: split monolithic file and move location (#11359) by @key4ng in #11359
[router][grpc] disable health check generation and increase timeout (#11353) by @slin1237 in #11353
[router][grpc] Add dependencies in Cargo.toml to support chat template rendering (#11342) by @CatherineSue in #11342
[router] Support history management using conversation (#11339) by @key4ng in #11339
[router] Fix all unused_qualifications (#11341) by @CatherineSue in #11341
[router][grpc] Cleanup debug logs in grpc_server and grpc_router (#11340) by @CatherineSue in #11340
[router] improve reasoning parser lock and reduce req cloning (#11336) by @slin1237 in #11336
[router] refactor generate to use new pipeline arch (#11323) by @slin1237 in #11323
[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (#11314) by @CatherineSue in #11314
[router] cleanup worker health check to return early (#11310) by @slin1237 in #11310
[router] support Openai router conversation API CRUD (#11297) by @key4ng in #11297
[router][grpc] Fix error message format in grpc chat handler (#11307) by @CatherineSue in #11307
[router][grpc] Fix sampling_params.stop_strs is None (#11306) by @CatherineSue in #11306
[router] fix grpc connection conversion and add optimization (#11305) by @slin1237 in #11305
[router][grpc] Refactor chat template content format detection (#11288) by @CatherineSue in #11288
[router] add get server info and get model info in grpc server (#11303) by @slin1237 in #11303
[router] add reasoning and tool parser argument in router (#11290) by @slin1237 in #11290
[router][grpc] Fix proto3 default value mismatches and cleanup unused fields (#11283) by @CatherineSue in #11283
[router][grpc] Refine streaming processes (#11277) by @CatherineSue in #11277
[router][tool call] Clean up redundant detect_format and has_tool_markers (#11270) by @CatherineSue in #11270
[router] add ipv6 support across all components (#11219) by @slin1237 in #11219
[router] add grpc router pd mode for chat and generate (#11140) by @slin1237 in #11140
[router] fix get load response parsin...

Contributors

fangjian601, jeffdn, and 23 other contributors

Assets 2

06 Oct 18:45

Fridge003

v0.5.3

a4a3d82

Release v0.5.3

Highlights

Day 0 Support for DeepSeek-V3.2 with Sparse Attention: https://lmsys.org/blog/2025-09-29-deepseek-V32/
Deterministic inference on multiple attention backends: https://lmsys.org/blog/2025-09-22-sglang-deterministic/
Integration of FlashAttention 4 prefill kernels
Enhancing support of Qwen3-Next with MTP, DP, optimized kernels and multiple hardware platforms
Support models including Qwen3-VL series, dots.vlm1, Ling-V2, Apertus, SOLAR

What's Changed

[Auto Sync] Update server_args.py (20250912) by @merrymercy in #10347
[CPU][doc] add torch.compile param in example commands by @ZailiWang in #10349
[router][ci] Add gpu utilization analyze with nvml by @key4ng in #10345
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked by @wenscarl in #9199
fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale by @trevor-m in #10296
model: support Apertus by @EduardDurech in #9774
fix dual stream bug by @yizhang2077 in #10352
[router] Basic OAI Response api by @key4ng in #10346
Implement Standalone gRPC Server for SGLang Python Scheduler by @CatherineSue in #10283
support memory_pool_host page first direct layout by @huangtingwei9988 in #10031
fix the break in FlashInferFusedMoE by @chenqianfzh in #10356
fix: resolve transfer_kv_all_layer_direct_lf_pf import error by @zhyncs in #10360
Support LingV2 model by @strgrb in #10359
Fix Bailing MoE model bugs by @yuan-luo in #10362
Revert add mainprocess's proctitle by @whybeyoung in #10351
model: support dots.vlm1 model by @yonghenglh6 in #8778
Support loading weights from remote instance by @amysaq2023 in #8215
add qwen3-next ut by @yizhang2077 in #10355
Fix chunked prefix cache for nvfp4 by @wenscarl in #10180
Fix FA4 import cause moe_fused_gate output be illegal memory by @fzyzcjy in #10368
Fix global input scale incompatible with CuTe DSL moe by @fzyzcjy in #10370
[router] Add Rerank Routing Logic in Regular Router by @fangjian601 in #10219
[router] enable sccache in ci and local build by @slin1237 in #10099
fix: add fast path for function call by @yizhang2077 in #9023
[Auto Sync] Update base_grammar_backend.py, llguidance_back... (20250911) by @merrymercy in #10333
fix: resolve gb200 image link by @zhyncs in #10343
fix: exclude protobuf generated code by @zhyncs in #10388
[bug] fix ci syntax by @slin1237 in #10390
Fix GPU fault issue when run dsv3 with dp mode and enable torch-compile by @kkHuang-amd in #10361
feat: add deepseek v3 fp4 ut by @zhyncs in #10391
Add sentencepiece to project dependencies by @mmangkad in #10386
[router] allow one router to support different model families and serving mode by @slin1237 in #10244
[router] Add get and cancel method for response api by @key4ng in #10387
Benchmark: Support API_KEY without 'bearer' by @Muqi1029 in #10380
Support Qwen3-Next on Ascend NPU by @iforgetmyname in #10379
[HiCache] fix mooncake config in different tp size by @stmatengss in #10377
[HiCache] doc: update deployment in readme by @stmatengss in #10332
[router] add not implemented functions for multi model trait by @slin1237 in #10394
[Auto Sync] Update xgrammar_backend.py (20250913) by @merrymercy in #10395
fix probs name which without temp scaling name by @narutolhy in #9984
Fix the style of sgl kernel by @merrymercy in #10398
fix: tool parse in large streaming chunk beginning with normal content by @JustinTong0323 in #10397
[Fix] Init mamba related memory pools with torch.zeros by @byjiang1996 in #10400
support qwen3_next blackwell by @yizhang2077 in #10403
[Fix] Support qwen3-next MTP+DP by @byjiang1996 in #10392
Update ROCm docker image to add sgl-router support by @kkHuang-amd in #10406
[Performance] Dynamic Batch Tokenizer by @sundar24295s in #9382
[Generative Score API] Scoring(Prefill-only) optimizations. by @sundar24295s in #9748
Remove repeatedly lists adding in init_incremental_detokenization by @hnyls2002 in #10412
[Hack] Add pd-disaggregation decode polling interval by @hnyls2002 in #10411
fix duplicated logger in eager_utils by @lj970926 in #10410
Fix cutlass moe accuracy drop caused by attention UB from DP padding mode by @fzyzcjy in #10414
Add self.capture_aux_hidden_states For GLM-4.5V by @zRzRzRzRzRzRzR in #10228
Add h200 fused moe config for Qwen3-Next by @Ximingwang-09 in #10404
Auto determine sgl kernel version in blackwell CI by @fzyzcjy in #10318
Fix the global scale fix does not support EPLB and improve enabling condition by @fzyzcjy in #10369
Let sgl-kernel changes be tested on srt by @fzyzcjy in #10313
[2/2] Speed up prefill mla attention concat by @fzyzcjy in #10157
Support offloading in fp8 by @fzyzcjy in #9948
Support global scale in addition to per expert scale for cutedsl moe by @fzyzcjy in #10270
Support profile args in Engine API by @fzyzcjy in #6539
Fix sgl-kernel + srt CI by @fzyzcjy in #10419
[PD metrics] Fix some uncompleted PD related metrics by @acelyc111 in #8627
Typo: in --enable-custom-logit-processor: agree with cli arg by @thalahors in #10076
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 by @sufeng-buaa in #9962
fix: use latest flashinfer by @zhyncs in #10428
fix: enable cu124 and cu128 build on main push by @zhyncs in #10431
[Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path by @ch-wan in #10429
Add split tile size for Triton attention by @ispobock in #10425
Fix correction bias undefined behavior for nvfp4 models by @fzyzcjy in #10426
feat: add dsv3 fp4 cutlass moe etp ut by @zhyncs in #10433
router: Add Embedding routing logic by @tao12345666333 in #10129
Revert "Fix FA4 import cause moe_fused_gate output be illegal memory" by @fzyzcjy in #10432
[4/N]DP refactor: support watching mode get_load and shortest queue strategy by @hnyls2002 in #10201
automatically label pr for ci by @merrymercy in #10435
Refactor TopK to ensure readability and extensibility by @ch-wan in #9338
Tiny fix wrong naming by @fzyzcjy in #10437
Fix label pr for ci by @merrymercy in #10441
metrics: support customer labels specified in request header by @acelyc111 in #10143
[docs / oneliner] update mmmu docs instruction by @vincentzed in #9768
Add reasoning examples for GPT-OSS in Markdown examples by @vincentzed in #9626
Fix label PR by @merrymercy in #10445
Update permissions in label-...

Contributors

fangjian601, reyoung, and 140 other contributors

Assets 2

12 Sep 03:50

zhyncs

v0.5.2

b0d25e7

Release v0.5.2

Highlights

SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends: https://lmsys.org/blog/2025-09-10-sglang-hicache/

What's Changed

feat: allow use local branch to build image by @gongwei-130 in #9546
[readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in #9547
[doc] deepseekv31 support by @XiaotongJiang in #9544
fix(grok): remove duplicate replicate_lm_head configuration by @vincentzed in #9549
chore: update configurer by @zhyncs in #9557
chore: bump v0.5.1.post1 by @zhyncs in #9558
[router] add right rustls dependency in sgl-router cargo.toml by @Bruce-x-1997 in #9498
fix: use sgl-kernel 0.3.5 by @zhyncs in #9565
Add target module validation for init adapters by @Beichen-Ma in #9429
fix: Update OpenAI client base URL in documentation by @JustinTong0323 in #9576
[PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats by @SCDESPERTATE in #7317
remove redundant rank0_log function. by @miter6 in #9560
Update CUTLASS 4.2 & Enable K-Major Scale Factor for SM90 FP8 Blockwise Group GEMM by @HydraQYH in #9559
Reintroduce memory usage fix by @fzyzcjy in #9535
Offload tensors by sharding on GPU by @fzyzcjy in #9536
bugfix for undefined logging functions in HarmonyBrowserTool & HarmonyPythonTool by @CiaranZhou in #9229
chore: upgrade flashinfer 0.2.14.post1 by @zhyncs in #9578
fix: revert #8593 by @zhyncs in #9581
fix: resolve tuning fused moe issue by @zhyncs in #9587
Tiny fix wrong comments by @fzyzcjy in #9589
chore: update config by @zhyncs in #9591
chore: bump v0.5.1.post2 by @zhyncs in #9592
[Doc] add LWS(LeaderWorkerSet) use case in sgl-router README by @Bruce-x-1997 in #9568
[Performance] Batch Send from Tokenizer Manager. by @sundar24295s in #9436
Fix GLM45 tool call multi-turn bug by @byjiang1996 in #9500
Fix GLM45v launch server cuda torch compile bug by @byjiang1996 in #9554
Fix Harmony reasoning parser for and auto-separation for gpt-oss models by @jonaslsaa in #9190
[docs] Refactor, remove compiled results and add gpt-oss by @zhaochenyang20 in #9613
[Fix] HiCache Bugfix & Mooncake Error Handling Enhance by @ykwd in #8901
Improve bench_one_batch_server script by @hnyls2002 in #9608
[router] add mistral tool parser by @slin1237 in #9622
[router] add qwen tool parser by @slin1237 in #9623
[router] add pythonic parser by @slin1237 in #9628
[router] add llama tool parser by @slin1237 in #9629
[router] add ut for mistral, llama, pythonic, and streaming tool parser by @slin1237 in #9632
[new feat] ascend backend support fia fusion kernel by @ZhengdQin in #8328
model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 by @netanel-haber in #9301
Fix lint for router by @hebiao064 in #9636
[docs] Update README with additional highlights and resources for SGLang x AMD SF Meetup by @wisclmy0611 in #9640
Add reasoning_effort param in TiktokenTokenizer.apply_chat_template by @lshmouse in #9630
fix: allow user to specify function as role by @GavinZhu-GMI in #9635
Fix kimi k2 function calling format by @XiaotongJiang in #9606
[router] address worker load tracking consistency by @slin1237 in #9523
[router] add token bucket rate limiter by @CatherineSue in #9656
[doc] add kimik2 --tool-call-parser by @XiaotongJiang in #9647
Install py-spy by default for containers for easier debugging by @fzyzcjy in #9649
BugFix(hicache): Fix host indices out of bound error by @hzh0425 in #9637
HiCache Storage fix host memory leak by @xiezhq-hermann in #9648
add response_format support for completion API by @cicirori in #9665
Fix FA3 swa spec verify topk>1 by @ispobock in #9658
[RL] fix register the same ops multiple times by @hebiao064 in #9564
chore: enhance bench_serving for vlms with a new dataset of configurable image count and resolution by @mickqian in #9583
refactor(hicache): Introduce generic HiCacheStorageConfig for improved configuration management by @hzh0425 in #9555
feat: (chat-template matching) enhance multimodal model detection with config.json by @KEVINTUAN12 in #9597
[docs] Instructions for bench_serving.py by @yhyang201 in #9071
Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #9446
Add A100 fused MoE kernel configs for Dpsk by @ehuaa in #9677
support cuda 13.0 and trtllm kernel by @rainj-me in #9495
fix: HiRadixCache: fix prefetch completion race by @pabloiyu in #9397
fix mooncake store mla zero copy meta by @huangtingwei9988 in #9678
move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py by @merrymercy in #9679
[router] restructure tool parser module folder by @slin1237 in #9693
[router] add deepseek tool parser by @slin1237 in #9694
Quick fix for loading processor for supporting internvl3_5 series by @yilian49 in #9676
Fix get_ip when no external network by @whybeyoung in #9700
Sets default model name in request classes by @JustinTong0323 in #9683
[router] add step3 tool parser by @slin1237 in #9695
[router] add kimi-k2 tool parser by @slin1237 in #9702
[router] add gpt-oss and glm4 tool parser by @slin1237 in #9703
[sgl-kernel] misc: update deepgemm version for sgl-kernel by @FlamingoPg in #9340
chore: upgrade sgl-kernel 0.3.7 by @zhyncs in #9708
chore: bump v0.5.1.post3 by @zhyncs in #9716
[router] upgrade kernel version in pd ci by @CatherineSue in #9720
[Sync] Update mxfp4.py (20250827) by @merrymercy in #9724
[router] fix error response in pd_router by @Bruce-x-1997 in #9505
[router] Add MCP Tool Handler by @key4ng in #9615
gpt-oss blog reproduction document by @hnyls2002 in #9728
[router] additional pythonic parser unit test by @slin1237 in #9730
[router] additional llama32 parser unit test and multi json support by @slin1237 in #9732
support mooncake store dp attention by @huangtingwei9988 in #9684
add support for nvidia/gpt-oss-120b-Eagle3 by @zyksir in #9739
Move git clone command up from README by @JustinTong0323 in #9740
[feat] Reduce GPU memory overhead by using weakref by @yhyang201 in #9673
Support speculative decoding in hybrid attention backend by @Qiaolin-Yu in #9573
[router] add llama3.2 multi json streaming parser by @slin1237 in #9735
Support compile sgl-kernel on cuda 13.0 by @rainj-me in https://github.co...

Contributors

pbkowalski, lshmouse, and 137 other contributors

Assets 2

23 Aug 19:57

zhyncs

v0.5.1

97a38ee

Release v0.5.1

What's Changed

[PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in #8595
[bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in #8611
Fix hf3fs_fuse import error by @ispobock in #8623
Update step3v default config by @ispobock in #8626
[ci] fix genai-bench execution cmd by @slin1237 in #8629
[router] update router pypi version by @slin1237 in #8628
[Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x by @b8zhong in #8577
Fix typos in py_test/test_launch_server.py by @windsonsea in #6227
misc: Remove debug print to logger.info by @CatherineSue in #8633
SGLang HiCache NIXL Connector by @vvenkates27 in #8488
[bug] remove pdlb from minilb since its no longer available by @slin1237 in #8634
[bugfix] Fix flashinfer cutlass EP moe after MoE refactor by @trevor-m in #8630
Conditionally import HiCacheHF3FS by @pansicheng in #8598
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) by @farazkh80 in #8632
Fix nan value generated after custom all reduce by @kkHuang-amd in #8532
Revert "Fix nan value generated after custom all reduce (#8532)" by @zhyncs in #8642
Feature/modelscope model download by @yrk111222 in #8083
chore: speedup NPU CI by cache by @pkking in #8270
[Bugfix] fix w8a8_int8 load issue by @iforgetmyname in #8308
[bugfix] fix router python parser for pd urls by @slin1237 in #8644
[router] add basic usage doc by @slin1237 in #8640
[router] upgrade router version to 0.1.8 by @slin1237 in #8645
[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE by @kaixih in #8450
HiCache, fixing hash value indexing by @xiezhq-hermann in #8636
Interface change for kvcache io to support page first layout by @xiezhq-hermann in #8318
Update batch size limitation of dsv3_router_gemm kernel to 16 by @Fridge003 in #8051
chore: bump v0.4.10.post1 by @ispobock in #8652
Add hf3fs_utils.cpp to package-data by @pansicheng in #8653
Fix chat template handling for OpenAI serving by @JustinTong0323 in #8635
Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… by @byjiang1996 in #8511
[5/N] MoE Refactor: Update MoE parallelism arguments by @ch-wan in #8658
Increase tolerance to address CI failures by @lifuhuang in #8643
[Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 by @panpan0000 in #8013
[Doc] fix: Update README for cu126 sgl-kernel compile problem by @Hongbosherlock in #8665
fix per token cuda kernel hidden dim cannot divide by 16 by @hebiao064 in #8543
fix arg typo for --disaggregation-transfer-backend by @ZacWang in #8664
[fix] fix pd disagg error of vlms by @ccw1996 in #8094
Disable tp for shared experts under expert parallelism for GLM4.5 model (#8647) by @zminglei in #8647
[bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla by @trevor-m in #8685
[bug] limit bootstrap room to to [0, 2^63 - 1] by @slin1237 in #8684
Update CODEOWNERS by @merrymercy in #8686
Fix deepgemm masked grouped gemm jit compile by @ispobock in #8679
Fix FP8 block quantization when N or K is not multiples of 128 by @yanbing-j in #8648
bugfix(hicache): Fix 'MooncakeStore' not defined error. by @hzh0425 in #8668
upgrade xgrammar 0.1.22 by @Swipe4057 in #8522
[bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually by @lbh2001 in #8618
Add support for NCCL symmetric memory for TP allreduces by @nvcastet in #8238
[1/2] sgl-kernel: Fuse routed scaling factor into select_experts by @trevor-m in #8364
chore(gb200): update dockerfile to handle fp4 disaggregation by @ishandhanani in #8694
[bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 by @trevor-m in #8688
Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled by @GaoYusong in #7434
model: adapt mllama4 to VisionAttention by @wenchen76 in #8512
Add tensor.detach() back to update weight util by @hebiao064 in #8691
[Doc] Polish sgl-kernel readme for cu126 build error by @FlamingoPg in #8704
Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" by @hnyls2002 in #8706
[router] minor code clean up and and refactoring by @slin1237 in #8711
[Bug] fix green context's incompatibility with cuda < 12.4 by @hnyls2002 in #8701
chore: bump sgl-kernel v0.2.9 by @zhyncs in #8713
Remove assertions about per group quant fp8 by @fzyzcjy in #8717
[FIX] Fix the nightly CI by disabling swa mem pool for gemma2 by @merrymercy in #8693
Fix triton moe error caused by TopK refactor by @fzyzcjy in #8705
[router] Implement HTTP Dependency Injection Pattern for Router System by @slin1237 in #8714
[Feature] Radix Tree in C++ by @DarkSharpness in #7369
[Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8722
Fix fused MoE when routed_scaling_factor is None by @hnyls2002 in #8709
Tiny fix CI pytest error by @fzyzcjy in #8524
[hotfix] fix mixtral with tensor-level compressed-tensor quantization by @ch-wan in #8721
Support limiting max loaded loras in CPU. by @lifuhuang in #8650
Reduce memory accumulation in long-running server by @Edenzzzz in #8306
HiCache storage, style change and bug fix by @xiezhq-hermann in #8719
[feat] support minimum token load balance in dp attention by @WANG-GH in #7379
Do layernorm before allgather for DP attention by @trevor-m in #8631
[fix] Fix divide by zero error for llama4. by @shenoyvvarun in #8683
feat: Add new moe triton for NVIDIA RTX 6000 Ada by @17Reset in #8547
[Improvements] Merge health check route by @whybeyoung in #8444
chore: bump sgl-kernel 0.3.0 with torch 2.8.0 by @zhyncs in #8718
Save cuda graph memory for fa3 by @ch-wan in #8567
[CUDA Graph] save cuda graph memory by using next_token_logits_buffer by @ch-wan in #8579
[DP] fix the compatibility issue between DP attention and --attention-backend triton by @ch-wan in #8723
chore: bump v0.4.10.post2 by @zhyncs in #8727
feat: Support DP Attention for step3_vl by @yhyang201 in #8699
[RL] fix update weight for FusedMoE with EP by @zhuzilin in #8676
use fp32 for e_score_correction_bias in GLM-4.5 by @zRzRzRzRzRzRzR in #8729
Fix triton kernels topk with keyword arguments by @ispobock in https://github.com/sgl-project/sglang/pull/...

Contributors

chanh, jeffdn, and 158 other contributors

Assets 2

17 Nov 10:58

slin1237

gateway-v0.1.9

500b15c

Release Gateway-v0.1.9

What's Changed in Gateway

Gateway Changes (10 commits)

[router] upgrade router version to 0.1.9 (#8844) by @slin1237 in #8844
refactor(sgl-router): Replace once_cell with LazyLock in worker.rs and remove once_cell dependency from Cargo.toml (#8698) by @htiennv in #8698
[router] fix req handling order, improve serialization, remove retry (#8888) by @slin1237 in #8888
[router] PD Router Simplification and Reorganization (#8838) by @slin1237 in #8838
[router] complete router oai spec (#8828) by @slin1237 in #8828
[pd-router] Add Configurable Retry Logic for reduce backend pressure (#8744) by @slin1237 in #8744
[router] introduce dp worker abstraction (#8639) by @slin1237 in #8639
[router] Implement HTTP Dependency Injection Pattern for Router System (#8714) by @slin1237 in #8714
[router] minor code clean up and and refactoring (#8711) by @slin1237 in #8711
[bug] limit bootstrap room to to [0, 2^63 - 1] (#8684) by @slin1237 in #8684

New Contributors

@htiennv made their first contribution in fd05b5675

Paths Included

sgl-router
python/sglang/srt/grpc
python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.1.8...gateway-v0.1.9

Contributors

slin1237 and htiennv

Assets 2

17 Nov 10:57

slin1237

gateway-v0.1.8

39decec

Release Gateway-v0.1.8

What's Changed in Gateway

Gateway Changes (4 commits)

[router] upgrade router version to 0.1.8 (#8645) by @slin1237 in #8645
[router] add basic usage doc (#8640) by @slin1237 in #8640
[bugfix] fix router python parser for pd urls (#8644) by @slin1237 in #8644
Fix typos in py_test/test_launch_server.py (#6227) by @windsonsea in #6227

New Contributors

@windsonsea made their first contribution in 061c8959f

Paths Included

sgl-router
python/sglang/srt/grpc
python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.1.7...gateway-v0.1.8

Contributors

slin1237 and windsonsea

Assets 2

31 Jul 18:48

zhyncs

v0.4.10

0232886

v0.4.10

Highlights

This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs

Please check the 2025 H2 roadmap #7736
GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/
SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/
Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs
https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/
Accelerating SGLang with Multiple Token Prediction https://lmsys.org/blog/2025-07-17-mtp/
How to support new VLMs into SGLang: A Case Study with NVILA https://lmsys.org/blog/2025-07-16-nvila/
Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/
slime: An SGLang-Native Post-Training Framework for RL Scaling https://lmsys.org/blog/2025-07-09-slime/

What's Changed

[AMD] add aiter fused moe in DeepEP path by @alexsun07 in #7268
enable aiter_biased_grouped_topk kernel by @valarLip in #7423
[PD Disaggregation] replace transfer with batch transfer for better performance by @ssssnow in #7236
Remove cumsum_buffer initilization by @ispobock in #7439
[benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm by @BBuf in #7422
Support multi-thread model weight loading by @xianzhiT in #7277
[PD] NIXL: Register kv args in advance and cleanup finished requests by @trevor-m in #6717
fix: Add --model as an alias for --model-path in server_args by @CatherineSue in #7505
misc: Improvement to serving_chat.py and add more ut by @CatherineSue in #7489
Fuse sorted_token_ids padding to moe_align_block_size kernel by @ispobock in #7437
[OAI] patch origin request_id logic by @whybeyoung in #7508
[PD][Spec] Fix hidden state transfer for spec decode by @ShangmingCai in #7516
EPLB support for MTP by @yilian49 in #7510
clean duplicate code by @habaohaba in #7512
[ci] add router benchmark script and CI by @slin1237 in #7498
fix: force synchronization between TP workers when update_weights by @dangkai4u in #6626
[CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model by @chunyuan-w in #6641
[CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug by @ShangmingCai in #7522
npu fused op by @ll819214 in #7386
feat: send kvmetrics from sglang scheduler by @zixuanzhang226 in #6721
[PD] Add different TP sizes support for no-MLA models by @Hongbosherlock in #6793
enable aiter fp8 blockscale quant by @valarLip in #7520
take aiter get_rope back by @valarLip in #7521
Fix typo of flash_cache by @hebiao064 in #7513
feat: add return hidden_states at async generation by @yyihuang in #7507
minor: 'role' must be system/assistant/tool, but case insensitive for now by @minleminzui in #7499
Fix FP8 KV Cache Support in FA3 Backend by @guoyuhong in #7148
Fix gathered_buffer issues in tbo by @Qiaolin-Yu in #7531
[PD] Raise error for incompatible mooncake version and some minor fixes by @ShangmingCai in #7527
[CMake] Fix sgl-kernel CMakeLists for Blackwell by @MasterJH5574 in #7543
Add Tencent HunYuanMoEV1 model support by @mpjlu in #7549
Update seed in CPU UTs to avoid flaky failure with single test by @yanbing-j in #7544
chore: improve ci bug reporting by @mickqian in #7542
chore: remove vlm unnecessary import by @JustinTong0323 in #7541
chore: bump v0.4.8.post1 by @zhyncs in #7559
[PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND by @trevor-m in #7330
[Fix] incorrect assert in EPLB by @ch-wan in #7575
Updates Gemma3n MLP layer to adapt latest transformers version by @JustinTong0323 in #7573
Fix MTP error when enabling two-batch overlap by @fzyzcjy in #7569
Add e2e test for multi instance multi stage memory release/resume occupuation by @MrAta in #7208
[CI] Add CI Testing for Prefill-Decode Disaggregation with Router by @key4ng in #7540
Updates transformers and timm dependencies by @JustinTong0323 in #7577
feat: support compatibility between MTP and two-batch-overlap by @Qiaolin-Yu in #7225
Move multimodal processors into a separate folder by @merrymercy in #7581
Fix broken CI TestVILAServer by @lifuhuang in #7610
[router] add centralized configuration module for sgl-router by @slin1237 in #7588
Fix: Minicpm by @JustinTong0323 in #7612
Hybrid kv cache for LLaMA4 by @tarinkk in #6563
[CPU] add optimizations for INT8 and FP8 DeepSeek by @chunyuan-w in #6769
Tiny add logs for expert location updater by @fzyzcjy in #7308
Fix flakiness in LoRA batch test. by @lifuhuang in #7552
[BUG] fix local_rank in initialize_dp_attention by @TomQuartz in #7584
Support dynamic LoRA loading / unloading in engine/server API by @lifuhuang in #7446
[PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated by @ShangmingCai in #7598
fix unit tests by @zhyncs in #7618
Let ep_scatter support arbitrary strides / ue8m0 format by @fzyzcjy in #7309
Let EP prefill support new DeepGEMM by @fzyzcjy in #7310
docs: add gb200 nvl72 and a16z grant by @zhyncs in #7620
Adds support for OpenAI chat completions API in bench_serving by @JustinTong0323 in #7036
[bugfix] Remove PR comment posting from Rust benchmark workflow by @slin1237 in #7625
[Minor] clean up multimodal processor and tokenizer manager by @merrymercy in #7624
Add dsv3 fused a gemm to sgl-kernel by @ispobock in #7630
Add @mickqian as the CODEOWNERS of multimodal by @merrymercy in #7636
Fix stream reasoning parser and Adds Kimi reasoning parser by @JustinTong0323 in #7432
Fix sgl-router startup crash by @finetunej in #7619
[bugfix] fix runtime dropping panic in editable by @slin1237 in #7628
Move files related to EPLB by @fzyzcjy in #7580
[misc] reduce weird rope_scaling_factor warning by @Alcanderian in #7176
[AMD] Add unit-test-sgl-kernel-amd to AMD CI by @hubertlu-tw in #7539
Update CODEOWNERS by @merrymercy in #7640
[EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py by @merrymercy in #7643
[CPU] add c++ kernel to bind CPU cores and memory node by @chunyuan-w in #7524
Improve streaming, log_level, memory report, weight loading, and benchmark script by @merrymercy in #7632
Add dsv3 router gemm kernel by @Fridge003 in #7627
chore: upgrade flashinfer v0.2.7 jit by @zhyncs in #7663
[doc] update lws doc for pd by @whybeyoung in #7318
Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes by @narutolhy in https://github.com/sgl-project...

Contributors

kzjeef, oldsharp, and 137 other contributors

Assets 2

17 Nov 10:51

slin1237

gateway-v0.1.7

aee0ef5

Release Gateway-v0.1.7

What's Changed in Gateway

Gateway/Router Changes (11 commits)

[router] update router pypi version (#8628) by @slin1237 in #8628
[router] migrate router from actix to axum (#8479) by @slin1237 in #8479
[feature] [sgl-router] Add a dp-aware routing strategy (#6869) by @oldsharp in #6869
[router] improve router logs and request id header (#8415) by @slin1237 in #8415
[router] add different policies for p node and d node (#8395) by @slin1237 in #8395
[router] add request format unit test (#8300) by @slin1237 in #8300
[router] add streaming unit test (#8299) by @slin1237 in #8299
[router] add endpoint unit test (#8298) by @slin1237 in #8298
[router] fix pd model completion request (#8303) by @slin1237 in #8303
[router] add common ut infra to mock worker and app (#8295) by @slin1237 in #8295
fix: sgl-router remove dead code (#8257) by @oldsharp in #8257

New Contributors

@oldsharp made their first contribution in a730ce816
@oldsharp made their first contribution in c33499a67

Paths Included

sgl-router
python/sglang/srt/grpc
python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.1.6...gateway-v0.1.7

Contributors

oldsharp and slin1237

Assets 2

Releases: sgl-project/sglang

Release v0.5.4

Highlights

What's Changed

Contributors

Uh oh!

Release Gateway-v0.2.1

🚀 SGLang Model Gateway v0.2.1 Released!

🧾 Docs & CI

🧹 Code Cleanup

🐞 Bug Fixes

🌟 New Features

What's Changed in Gateway

Gateway Changes (26 commits)

Paths Included

Contributors

Uh oh!

Release Gateway-v0.2.0

🚀 Release: SGLang Model Gateway v0.2.0 (formerly “SGLang Router”)

🔥 What’s new

🧠 Multi-Model Inference Gateway (IGW) Mode

⚡ gRPC Mode: Rust-Powered, Built for Throughput

🌐 OpenAI-Compatible Gateway

🛡️ Reliability & Observability Upgrades

What's Changed in Gateway

Gateway Changes (238 commits)

Contributors

Uh oh!

Release v0.5.3

Highlights

What's Changed

Contributors

Uh oh!

Release v0.5.2

Highlights

What's Changed

Contributors

Uh oh!

Release v0.5.1

What's Changed

Contributors

Uh oh!

Release Gateway-v0.1.9

What's Changed in Gateway

Gateway Changes (10 commits)

New Contributors

Paths Included

Contributors

Uh oh!

Release Gateway-v0.1.8

What's Changed in Gateway

Gateway Changes (4 commits)

New Contributors

Paths Included

Contributors

Uh oh!

v0.4.10

Highlights

What's Changed

Contributors

Uh oh!

Release Gateway-v0.1.7

What's Changed in Gateway

Gateway/Router Changes (11 commits)

New Contributors

Paths Included

Contributors

Uh oh!