Export NaNs in logits to scheduler_stats if output is corrupted #18777

vladmihailescu · 2025-05-27T19:04:16Z

Signed-off-by: Vlad Mihailescu [email protected]

Summary:
Report nan in logits in scheduler_stats. This can be used later exported as Phrometeus counter but for now this is required so we can export it in our internal counter infra.

This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits.

It's a common metric we expose.

Reviewed By: Adolfo-Karim

Differential Revision: D75423285

NO ENV VAR (NaN counting off):
I0618 22:24:27.611000 180881 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.56s
Success rate:        100.00%
QPS:                 7.14
Avg latency:         11.361s
Avg TTFT (client):   8245.31ms
P50 TTFT (client):   8262.13ms
P99 TTFT (client):   8283.14ms
Avg TTIT (client):   311.60ms
P50 TTIT (client):   323.78ms
P99 TTIT (client):   334.87ms
Avg TTFT (server):   6748.17ms
Avg TTIT (server):   357.83ms
Avg prefill len:     2587.54 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2596.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens
I0618 22:22:13.811000 141492 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.30s
Success rate:        100.00%
QPS:                 7.18
Avg latency:         11.312s
Avg TTFT (client):   8209.60ms
P50 TTFT (client):   8214.03ms
P99 TTFT (client):   8273.95ms
Avg TTIT (client):   310.22ms
P50 TTIT (client):   322.10ms
P99 TTIT (client):   325.09ms
Avg TTFT (server):   6639.55ms
Avg TTIT (server):   355.36ms
Avg prefill len:     2587.49 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2595.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens

WITH ENVVAR (NaN counting on):
I0618 22:44:09.103000 508486 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.18s
Success rate:        100.00%
QPS:                 7.19
Avg latency:         11.291s
Avg TTFT (client):   8193.95ms
P50 TTFT (client):   8208.52ms
P99 TTFT (client):   8232.08ms
Avg TTIT (client):   309.75ms
P50 TTIT (client):   321.92ms
P99 TTIT (client):   333.03ms
Avg TTFT (server):   8574.45ms
Avg TTIT (server):   355.91ms
Avg prefill len:     2587.90 tokens
P50 prefill len:     2588.00 tokens
P99 prefill len:     2596.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens
I0618 22:47:21.294000 566871 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 57.23s
Success rate:        100.00%
QPS:                 7.06
Avg latency:         11.532s
Avg TTFT (client):   8365.60ms
P50 TTFT (client):   8229.34ms
P99 TTFT (client):   9156.80ms
Avg TTIT (client):   316.59ms
P50 TTIT (client):   322.71ms
P99 TTIT (client):   414.20ms
Avg TTFT (server):   6794.25ms
Avg TTIT (server):   358.08ms
Avg prefill len:     2587.78 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2597.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens

And unit test

pytest -s -v tests/v1/worker/test_gpu_model_runner.py

tests/v1/worker/test_gpu_model_runner.py::test_get_nans_in_logits INFO 06-19 02:06:26 [config.py:1444] Using max model len 2048
WARNING 06-19 02:06:26 [config.py:4758] Current vLLM config is not set.
WARNING 06-19 02:06:26 [config.py:4758] Current vLLM config is not set.
WARNING 06-19 02:06:26 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
PASSED

FIX #17123

github-actions · 2025-05-27T19:04:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

facebook-github-bot · 2025-05-27T19:04:30Z

This pull request was exported from Phabricator. Differential Revision: D75423285

simon-mo · 2025-05-27T20:03:31Z

Keep in mind that the logits needs to be serialized from model runner back to the scheduler via RPC/collectives. So doing the counting in model runner will be better.

facebook-github-bot · 2025-05-27T20:06:00Z

This pull request was exported from Phabricator. Differential Revision: D75423285

facebook-github-bot · 2025-05-27T20:12:16Z

This pull request was exported from Phabricator. Differential Revision: D75423285

facebook-github-bot · 2025-05-27T20:16:17Z

This pull request was exported from Phabricator. Differential Revision: D75423285

markmc · 2025-05-27T20:22:46Z

It doesn't seem like we actually need an exact count of nans - we just want a signal that corruption is spiking?

What happens the request when this happens? Is the request aborted, or ..?

Could we do something similar to the vllm:request_success_total metric - maybe add another FinishReason like CORRUPTED?

markmc · 2025-05-28T13:48:56Z

Some overlap with #18765 ... except I'm not sure this NaN case results in the request explicitly failing?

vladmihailescu · 2025-05-28T19:03:31Z

@markmc with NaNs the requests won't fail by default. With the existing behaviour the output of the requests will just be gibberish and that's why I'm not sure if we should add a new finish reason because NaNs will not forcefully fail the request.

Internally we observe 2 behaviours in which NaNs appear:

Bad GPU - this causes all the requests landing on that GPU to have NaNs. We fail warmup if this happens during warmup (so the container restarts and retries warmup) but if it happens while the task is healthy, the container won't stop because there can be random flaky NaNs (see point 2).
Flaky - very rarely happens that a request will have NaNs in logits even though it was processed on a healthy GPUs.

I am trying to add now the per request NaNs and maybe have a bool per request telling if it's corrupted or not, but not sure if we should make it a finish reason.

houseroad

Looks good to me. Thanks for adding this metrics.

vladmihailescu · 2025-06-20T04:17:51Z

V1 test timed out. Rebasing after the fix (#19872)

yeqcharlotte · 2025-06-20T05:12:44Z

vllm/v1/worker/gpu_model_runner.py

is logits optionally None here or not?

yeqcharlotte · 2025-06-20T05:17:47Z

vllm/v1/worker/gpu_model_runner.py

nit: you got 2 if logits is not none check here and one in the loop. for better readability it might make sense to do the following:

if logits is None: return {req_id: 0 for ...} num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy() # count nans

yeqcharlotte

thanks for adding this! leaving some small nits only.

vladmihailescu · 2025-06-20T07:43:47Z

Addressed nits

…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <[email protected]> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <[email protected]>

Signed-off-by: nie3e <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> added notebooks to playground updates remoted verbatim HF secrets from all files updates [custom_op][vllm-plugin] update custom_op class to use op_registry (vllm-project#19164) Signed-off-by: Chendi.Xue <[email protected]> Export NaNs in logits to scheduler_stats if output is corrupted (vllm-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> [CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests (vllm-project#19901) Signed-off-by: jiang1.li <[email protected]> [Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (vllm-project#19749)

…-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> Signed-off-by: juncheoll <[email protected]>

…-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> Signed-off-by: fhl <[email protected]>

vladmihailescu requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners May 27, 2025 19:04

mergify bot added v1 tpu Related to Google TPUs labels May 27, 2025

vladmihailescu force-pushed the export-D75423285 branch from ec80866 to f1f3d22 Compare May 27, 2025 20:02

vladmihailescu force-pushed the export-D75423285 branch from f1f3d22 to b985bba Compare May 27, 2025 20:06

vladmihailescu force-pushed the export-D75423285 branch from b985bba to 15b58d4 Compare May 27, 2025 20:08

vladmihailescu force-pushed the export-D75423285 branch from 15b58d4 to fc7d538 Compare May 27, 2025 20:12

vladmihailescu force-pushed the export-D75423285 branch from fc7d538 to 957944f Compare May 27, 2025 20:12

vladmihailescu force-pushed the export-D75423285 branch from 957944f to 7282ec4 Compare May 27, 2025 20:16

vladmihailescu force-pushed the export-D75423285 branch from 7282ec4 to eed0d23 Compare May 29, 2025 09:08

vladmihailescu requested a review from yeqcharlotte June 19, 2025 09:19

houseroad approved these changes Jun 19, 2025

View reviewed changes

houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 19, 2025

vladmihailescu force-pushed the export-D75423285 branch from 7b3c0b1 to 8c7244c Compare June 20, 2025 04:01

yeqcharlotte reviewed Jun 20, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated

Copy link

Collaborator

yeqcharlotte Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is logits optionally None here or not?

yeqcharlotte reviewed Jun 20, 2025

View reviewed changes

yeqcharlotte approved these changes Jun 20, 2025

View reviewed changes

vladmihailescu force-pushed the export-D75423285 branch from 8c7244c to 451048e Compare June 20, 2025 07:43

vladmihailescu force-pushed the export-D75423285 branch from 451048e to 3bb49c3 Compare June 20, 2025 07:47

houseroad merged commit 2e3e3c8 into vllm-project:main Jun 20, 2025
67 checks passed

juncheoll pushed a commit to juncheoll/vllm that referenced this pull request Jun 23, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

03a8c9a

…-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> Signed-off-by: juncheoll <[email protected]>

fhl2000 pushed a commit to fhl2000/vllm that referenced this pull request Jun 25, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

c90f70b

…-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> Signed-off-by: fhl <[email protected]>

gnovack mentioned this pull request Jul 28, 2025

[Bug]: IndexError: list index out of range on chunked prefill with speculative decoding #20531

Open

1 task

This was referenced Oct 22, 2025

[Feature]: Add num_corrupted_request metric to V1 metrics system. #27301

Open

[Feature]: Add corrupted request metric to V1 metrics system. #27306

Open

DarkLight1337 mentioned this pull request Nov 5, 2025

[Feature]: Automatically detect numerical issues #17123

Closed

1 task

Uh oh!

Export NaNs in logits to scheduler_stats if output is corrupted #18777

Export NaNs in logits to scheduler_stats if output is corrupted #18777

Conversation

vladmihailescu commented May 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

simon-mo commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

markmc commented May 27, 2025

Uh oh!

markmc commented May 28, 2025

Uh oh!

vladmihailescu commented May 28, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

vladmihailescu commented Jun 20, 2025

Uh oh!

yeqcharlotte Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

vladmihailescu commented Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vladmihailescu commented May 27, 2025 •

edited by github-actions bot

Loading

yeqcharlotte Jun 20, 2025 •

edited

Loading