[SB] relax constraint on min number of new tokens #322

yannicks1 · 2025-07-17T11:56:19Z

[SB] relax constraint on min number of new tokens

this is relaxing an old constraint on the number of requested new tokens having to be a min of 3. Turns out it is only important that during the warmup there is at least one decode forward pass. Requesting 1 token runs prefill only during warmup -> compiler crashes (I guess it is expecting two graphs, prefill and decode) . Requesting 2+ tokens does at least one decode during warmup and thus produces a decode graph too -> things run smoothly for 2+ tokens ...

Signed-off-by: Yannick Schnider <[email protected]>

github-actions · 2025-07-17T11:56:30Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

yannicks1 · 2025-07-17T11:58:49Z

To my knowledge we previously didn't have a test for the case num_decode_tokens=3, hence I didn't add one for num_decode_tokens=2. Do we want such a test?

sducouedic

LGTM, thanks for fixing this

Edit 1: check code change suggestion before merging
Edit 2: feel free to add such a test, it wouldn't hurt to have one. I guess it would belong to test_spyre_warmup_shapes.py?

sducouedic · 2025-07-18T14:23:08Z

tests/e2e/test_spyre_basic.py

+        SamplingParams(max_tokens=max_new_tokens[i],
+                       min_tokens=max_new_tokens[i],
+                       temperature=0,
+                       ignore_eos=True,
+                       logprobs=0) for i in range(len(max_new_tokens))


Suggested change

SamplingParams(max_tokens=max_new_tokens[i],

min_tokens=max_new_tokens[i],

temperature=0,

ignore_eos=True,

logprobs=0) for i in range(len(max_new_tokens))

SamplingParams(max_tokens=max_tokens_i,

min_tokens=max_tokens_i,

temperature=0,

ignore_eos=True,

logprobs=0) for max_tokens_i in max_new_tokens)

joerunde

lpgtm!

This was confusing to at least one user already who thought that meant that you also had to request at least 3 tokens in each api call, but I don't think we should focus too much on this anyway since continuous batching is almost ready

yannicks1 added 2 commits July 17, 2025 13:49

relax constraint of min num output tokens from 3 to 2

b104329

Signed-off-by: Yannick Schnider <[email protected]>

refactor test script with for loop

243d571

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 requested review from rafvasq, prashantgupta24, sducouedic, tdoublep and nikolaospapandreou as code owners July 17, 2025 11:56

yannicks1 requested a review from joerunde July 17, 2025 11:59

sducouedic approved these changes Jul 18, 2025

View reviewed changes

joerunde approved these changes Jul 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SB] relax constraint on min number of new tokens #322

[SB] relax constraint on min number of new tokens #322

Uh oh!

yannicks1 commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

yannicks1 commented Jul 17, 2025

Uh oh!

sducouedic left a comment •

edited

Loading

Uh oh!

sducouedic Jul 18, 2025

Uh oh!

joerunde left a comment

Uh oh!

Uh oh!

[SB] relax constraint on min number of new tokens #322

Are you sure you want to change the base?

[SB] relax constraint on min number of new tokens #322

Uh oh!

Conversation

yannicks1 commented Jul 17, 2025