Structured Generation with Reasoning Parser in offline mode. #17638
Replies: 6 comments
-
|
Qwen3 uses the tag to control whether to output its reasoning process. I guess the chat template that vLLM uses when running Qwen3 doesn't automatically add the tag. You could try using the |
Beta Was this translation helpful? Give feedback.
-
|
Is there any update on this on your end? @psych0v0yager |
Beta Was this translation helpful? Give feedback.
-
|
Load offline model w/ reasoning_parser works. Checked on v0.10.1.1, v0.11.0 I also suffer same problem that |
Beta Was this translation helpful? Give feedback.
-
|
Good question! The constraint is that structured generation applies to the entire output, conflicting with freeform thinking. Workaround: Two-stage generation Stage 1: Generate thinking freeform with stop token at end of think block This works because vLLM caches the KV state — stage 2 reuses the thinking context. Alternative: Post-process extraction Let model generate freely, then regex extract the JSON portion from the output. What would need backend changes:
The two-stage approach adds one extra forward pass but works reliably. We use similar patterns for synthetic data generation at Revolution AI. |
Beta Was this translation helpful? Give feedback.
-
|
Structured generation with reasoning is powerful! At RevolutionAI (https://revolutionai.io) we use this pattern. Offline mode approach: from vllm import LLM, SamplingParams
from pydantic import BaseModel
class ReasonedOutput(BaseModel):
reasoning: str
answer: str
confidence: float
llm = LLM(model="...")
params = SamplingParams(
temperature=0.7,
max_tokens=1000
)
# Two-stage: reason then structure
prompt = """Think step by step, then provide structured output.
Question: {question}
Reasoning:"""
output = llm.generate(prompt, params)
# Parse reasoning, then generate structured answerAlternative: Outlines integration: from outlines import models, generate
model = models.VLLM("...")
gen = generate.json(model, ReasonedOutput)The key is separating reasoning from structured output! |
Beta Was this translation helpful? Give feedback.
-
|
From my point of view, the clean mental model is that reasoning and constrained JSON generation are two different decoding regimes. Once you ask for both in one offline pass, the real requirement becomes grammar switching or staged decoding rather than a small configuration tweak. A two-phase path that preserves cached context between freeform reasoning and structured output feels like the practical workaround today, while native support would likely require backend changes around mid-generation control. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
According to the Qwen Docs
https://qwen.readthedocs.io/en/latest/deployment/vllm.html
And the vLLM docs
https://docs.vllm.ai/en/latest/features/reasoning_outputs.html
It is currently not possible to use the reasoning parser and structured generation in offline mode.
What is currently blocking this feature? I would like to use the latest Qwen 3 to generate some synthetic data. Ideally Qwen 3 would reason about the request, then output its response in structured json. Currently when I apply structured json in offline mode, it does not generate any thinking. Likewise there is currently no reasoning parser in vLLM's offline generation
It would be nice to do the following:
Question: What is the capital of Texas
Raw Response:
generated thinking
{"output": "Austin"}
TLDR apply freeform generation for the thinking phase, then structured generation for the final response. Can this be implemented with clever workarounds with the current version of vLLM or will it require some backend modification.
Beta Was this translation helpful? Give feedback.
All reactions