I just published a performance test result of vllm vs sglang but can someone help me explain it? #17221

qiulang · 2025-04-26T05:44:16Z

qiulang
Apr 26, 2025

Hi, I am using vllm for all my projects but I had been thinking maybe I should give sglang a try. So I did a performance test against them. Before the test I had no idea what result I would get as I had no bias at all. So I was very surprised about the result!

I use one A10 GPU to test Qwen 2.5-7B, as I have a specific, focused goal: to evaluate how vLLM and SGLang perform when running a small LLM model on a mid-range NVIDIA GPU like A10.

I find that SGLang only uses 7G GPU memory compared with 21G memory (A10 has 24 G memory in total) and delivers a much better result, especially the consistent response times.

But why is such big difference ? Can someone help to explain it ? This is my project, https://github.com/qiulang/vllm-sglang-perf

Thanks a lot.

qiulang · 2025-04-27T07:31:49Z

qiulang
Apr 27, 2025
Author

Finally find out why such huge GPU memory usage difference lol

The counterpart of --max-model-len in sglang is --context-length NOT --max-total-tokens

Once I changed to --context-length, the GPU memory usage is basically the same

0 replies

xXMrNidaXx · 2026-02-23T13:04:21Z

xXMrNidaXx
Feb 23, 2026

Great benchmark comparison! At RevolutionAI (https://revolutionai.io), we have done similar vLLM vs SGLang comparisons. A few factors that explain typical differences:

Where vLLM often wins:

High concurrency / batching scenarios
Continuous batching optimization
Memory efficiency with PagedAttention

Where SGLang may win:

Complex multi-turn with RadixAttention (KV cache reuse)
Structured generation (constrained decoding)
Speculative decoding workloads

Benchmark methodology tips:

Warm up both systems (first few requests are slower)
Test at YOUR expected concurrency level
Measure p50, p95, p99 — not just average
Include time-to-first-token AND total time

Key insight: The "winner" depends heavily on your workload pattern. Batch inference? vLLM. Interactive chat with shared context? SGLang might edge ahead.

What was your test setup — concurrent users, prompt lengths, model size?

0 replies

xXMrNidaXx · 2026-02-23T13:41:41Z

xXMrNidaXx
Feb 23, 2026

This vLLM vs SGLang benchmark is really valuable! Let me help explain some differences:

Why SGLang might be faster in some cases:

RadixAttention — SGLang's prefix caching is more aggressive
Speculative execution — Better at predicting likely paths
Compiler optimizations — CUDA graph capture is tighter

Why vLLM might be faster in others:

PagedAttention v2 — Excellent for variable-length batches
Continuous batching — Better throughput under load
More mature scheduling — Years of optimization

Key factors in your benchmark:

| Factor | Affects |
|--------|--------|
| Prompt length | Prefix caching benefit |
| Output length | Decoding speed |
| Batch size | Scheduling efficiency |
| Model size | Memory management |
| Quantization | Kernel performance |

To explain your specific results:

What model did you test?
What batch sizes?
Single user or concurrent?

We've benchmarked both extensively at RevolutionAI for production deployments. Generally vLLM wins on throughput, SGLang on latency — but your workload matters most.

Can you share your test config?

0 replies

xXMrNidaXx · 2026-02-23T16:14:31Z

xXMrNidaXx
Feb 23, 2026

Great benchmarking work! The memory and performance differences come from architectural choices.

Why SGLang uses less memory (7GB vs 21GB):

Lazy KV cache allocation
- SGLang allocates KV cache on-demand
- vLLM pre-allocates for max batch size
Different default gpu_memory_utilization
- vLLM default: 0.9 (uses 90% of VRAM)
- SGLang: more conservative defaults
Memory pool strategies differ

To make vLLM use less memory:

vllm serve qwen2.5-7b \
  --gpu-memory-utilization 0.3 \
  --max-model-len 2048 \
  --max-num-seqs 8

Why SGLang has more consistent latency:

RadixAttention — Better prefix caching
Continuous batching optimizations differ
Scheduling algorithms prioritize differently

Fair comparison settings:

# vLLM
vllm serve --gpu-memory-utilization 0.3

# SGLang
python -m sglang.launch_server --mem-fraction 0.3

What to benchmark:

Fixed memory allocation for both
Same batch sizes
Same max_tokens
Measure TTFT + TPS separately

We benchmark inference engines at Revolution AI — memory utilization config is the biggest factor in these comparisons.

2 replies

qiulang Feb 24, 2026
Author

Thanks for the comments! It has been a while since I did that test so let me check it again and think over your comments before I get back to you. lol

I did remember I used L20 to test qwen3-8b awq against vLLM & SGLang.

qiulang Feb 26, 2026
Author

@xXMrNidaXx you may check this https://github.com/qiulang/vllm-sglang-perf/blob/main/next_step.md (I published 5 months ago but I did not publish the result on L20 though). But I remember for qwen3-32B-awq on L20 using vllm, the TTFT is around 6 seconds. not good. So L20 is not suitable for 32B, 8B is okay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I just published a performance test result of vllm vs sglang but can someone help me explain it? #17221

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

I just published a performance test result of vllm vs sglang but can someone help me explain it? #17221

Uh oh!

qiulang Apr 26, 2025

Replies: 4 comments · 2 replies

Uh oh!

qiulang Apr 27, 2025 Author

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

qiulang Feb 24, 2026 Author

Uh oh!

Uh oh!

qiulang Feb 26, 2026 Author

qiulang
Apr 26, 2025

Replies: 4 comments 2 replies

qiulang
Apr 27, 2025
Author

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

qiulang Feb 24, 2026
Author

qiulang Feb 26, 2026
Author