-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Export NaNs in logits to scheduler_stats if output is corrupted #18777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request was exported from Phabricator. Differential Revision: D75423285 |
ec80866 to
f1f3d22
Compare
|
Keep in mind that the logits needs to be serialized from model runner back to the scheduler via RPC/collectives. So doing the counting in model runner will be better. |
|
This pull request was exported from Phabricator. Differential Revision: D75423285 |
f1f3d22 to
b985bba
Compare
b985bba to
15b58d4
Compare
|
This pull request was exported from Phabricator. Differential Revision: D75423285 |
15b58d4 to
fc7d538
Compare
fc7d538 to
957944f
Compare
|
This pull request was exported from Phabricator. Differential Revision: D75423285 |
957944f to
7282ec4
Compare
|
It doesn't seem like we actually need an exact count of nans - we just want a signal that corruption is spiking? What happens the request when this happens? Is the request aborted, or ..? Could we do something similar to the |
|
Some overlap with #18765 ... except I'm not sure this NaN case results in the request explicitly failing? |
|
@markmc with NaNs the requests won't fail by default. With the existing behaviour the output of the requests will just be gibberish and that's why I'm not sure if we should add a new finish reason because NaNs will not forcefully fail the request. Internally we observe 2 behaviours in which NaNs appear:
I am trying to add now the per request NaNs and maybe have a bool per request telling if it's corrupted or not, but not sure if we should make it a finish reason. |
7282ec4 to
eed0d23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for adding this metrics.
7b3c0b1 to
8c7244c
Compare
|
V1 test timed out. Rebasing after the fix (#19872) |
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is logits optionally None here or not?
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you got 2 if logits is not none check here and one in the loop. for better readability it might make sense to do the following:
if logits is None:
return {req_id: 0 for ...}
num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy()
# count nans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this! leaving some small nits only.
8c7244c to
451048e
Compare
|
Addressed nits |
…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <[email protected]> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <[email protected]>
451048e to
3bb49c3
Compare
Signed-off-by: nie3e <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> added notebooks to playground updates remoted verbatim HF secrets from all files updates [custom_op][vllm-plugin] update custom_op class to use op_registry (vllm-project#19164) Signed-off-by: Chendi.Xue <[email protected]> Export NaNs in logits to scheduler_stats if output is corrupted (vllm-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> [CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests (vllm-project#19901) Signed-off-by: jiang1.li <[email protected]> [Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (vllm-project#19749)
…-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> Signed-off-by: juncheoll <[email protected]>
…-project#18777) Signed-off-by: Vlad Mihailescu <[email protected]> Signed-off-by: fhl <[email protected]>
Signed-off-by: Vlad Mihailescu [email protected]
Summary:
Report nan in logits in scheduler_stats. This can be used later exported as Phrometeus counter but for now this is required so we can export it in our internal counter infra.
This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits.
It's a common metric we expose.
Reviewed By: Adolfo-Karim
Differential Revision: D75423285
And unit test
FIX #17123