Structured Generation with Reasoning Parser in offline mode. #17638

psych0v0yager · 2025-05-04T17:25:19Z

psych0v0yager
May 4, 2025

According to the Qwen Docs
https://qwen.readthedocs.io/en/latest/deployment/vllm.html

And the vLLM docs
https://docs.vllm.ai/en/latest/features/reasoning_outputs.html

It is currently not possible to use the reasoning parser and structured generation in offline mode.

What is currently blocking this feature? I would like to use the latest Qwen 3 to generate some synthetic data. Ideally Qwen 3 would reason about the request, then output its response in structured json. Currently when I apply structured json in offline mode, it does not generate any thinking. Likewise there is currently no reasoning parser in vLLM's offline generation

It would be nice to do the following:
Question: What is the capital of Texas
Raw Response:

generated thinking

{"output": "Austin"}

TLDR apply freeform generation for the thinking phase, then structured generation for the final response. Can this be implemented with clever workarounds with the current version of vLLM or will it require some backend modification.

princepride · 2025-05-23T07:04:26Z

princepride
May 23, 2025

Qwen3 uses the tag to control whether to output its reasoning process. I guess the chat template that vLLM uses when running Qwen3 doesn't automatically add the tag. You could try using the --chat-template argument to add a chat template that includes the tag.

0 replies

premsa · 2025-08-26T15:25:47Z

premsa
Aug 26, 2025

Is there any update on this on your end? @psych0v0yager

0 replies

cieske · 2025-10-31T02:47:12Z

cieske
Oct 31, 2025

Load offline model w/ reasoning_parser works. Checked on v0.10.1.1, v0.11.0

llm = LLM(Qwen/Qwen3-30B-A3B, reasoning_parser='qwen3')

I also suffer same problem that Qwen3-30B-A3B returns just a structured output w/o any reasoning while thinking is enabled, adding it when loading offline model resolves this issue.

0 replies

xXMrNidaXx · 2026-02-23T13:53:20Z

xXMrNidaXx
Feb 23, 2026

Good question! The constraint is that structured generation applies to the entire output, conflicting with freeform thinking.

Workaround: Two-stage generation

Stage 1: Generate thinking freeform with stop token at end of think block
Stage 2: Use the thinking as context, generate structured JSON output

This works because vLLM caches the KV state — stage 2 reuses the thinking context.

Alternative: Post-process extraction

Let model generate freely, then regex extract the JSON portion from the output.

What would need backend changes:

Switching grammar mid-generation based on token patterns
Native support for thinking prefix in SamplingParams

The two-stage approach adds one extra forward pass but works reliably. We use similar patterns for synthetic data generation at Revolution AI.

0 replies

xXMrNidaXx · 2026-02-23T15:30:24Z

xXMrNidaXx
Feb 23, 2026

Structured generation with reasoning is powerful! At RevolutionAI (https://revolutionai.io) we use this pattern.

Offline mode approach:

from vllm import LLM, SamplingParams
from pydantic import BaseModel

class ReasonedOutput(BaseModel):
    reasoning: str
    answer: str
    confidence: float

llm = LLM(model="...")
params = SamplingParams(
    temperature=0.7,
    max_tokens=1000
)

# Two-stage: reason then structure
prompt = """Think step by step, then provide structured output.
Question: {question}

Reasoning:"""

output = llm.generate(prompt, params)
# Parse reasoning, then generate structured answer

Alternative: Outlines integration:

from outlines import models, generate
model = models.VLLM("...")
gen = generate.json(model, ReasonedOutput)

The key is separating reasoning from structured output!

0 replies

aniruddhaadak80 · 2026-03-09T22:51:27Z

aniruddhaadak80
Mar 9, 2026

From my point of view, the clean mental model is that reasoning and constrained JSON generation are two different decoding regimes. Once you ask for both in one offline pass, the real requirement becomes grammar switching or staged decoding rather than a small configuration tweak.

A two-phase path that preserves cached context between freeform reasoning and structured output feels like the practical workaround today, while native support would likely require backend changes around mid-generation control.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Structured Generation with Reasoning Parser in offline mode. #17638

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Structured Generation with Reasoning Parser in offline mode. #17638

Uh oh!

psych0v0yager May 4, 2025

Replies: 6 comments

Uh oh!

princepride May 23, 2025

Uh oh!

premsa Aug 26, 2025

Uh oh!

cieske Oct 31, 2025

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

aniruddhaadak80 Mar 9, 2026

psych0v0yager
May 4, 2025

princepride
May 23, 2025

premsa
Aug 26, 2025

cieske
Oct 31, 2025

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

aniruddhaadak80
Mar 9, 2026