-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
Your current environment
Is there a way to control default thinking behaviour for models deployed through vllm.
As per https://docs.vllm.ai/en/stable/features/reasoning_outputs.html,
IBM Grantie 3.2 reasoning is disabled by default.
Qwen3, GLM 4.6, Deepseek V3.1 all have reasoning enabled by default.
It would be great if there is a way to control this from vllm.
--override-generation-config allows user to override temperature and other params at deployment.
But this does not work for reasoning.
I have tried
docker run -d --runtime nvidia -e TRANSFORMERS_OFFLINE=1 -e DEBUG="true" -p 8000:8000 --ipc=host vllm/vllm-openai:v0.11.0 --reasoning-parser qwen3 --model Qwen/Qwen3-4B --override-generation-config '{"chat_template_kwargs": {"enable_thinking": false}}'
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.