Bug: Gaps under load in scheduler's node resource metrics #1276

sharnoff · 2025-02-21T13:53:03Z

Environment

Production

Steps to reproduce

Put the scheduler under load (e.g., sustaining >60 reconcile operations per second).

Expected result

The scheduler plugin's node resource metrics should have no gaps.

Actual result

There's occasional, small gaps in the metrics. For example:

Other logs, links

...

Fixes #1276. Currently, the way we update these node metrics is by removing all the old ones, then adding back the current values. If metrics are scraped in between removing the old and adding the new, we can end up with single-datapoint gaps for one node at a time. So to fix this, we should avoid removing the old metrics if and only if the labels are unchanged -- which we can check just by storing the previous labels we used.

Fixes #1276. Currently, the way we update these node metrics is by removing all the old ones, then adding back the current values. We do it that way so that the old values can be cleaned up when there's label changes. However: if metrics are scraped in between removing the old and adding the new, we can end up with single-datapoint gaps for one node at a time. So to fix this, we should avoid removing the old metrics if and only if the labels are unchanged -- which we can check just by storing the previous labels we used.

sharnoff added c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/bug Issue Type: Bug labels Feb 21, 2025

sharnoff self-assigned this Feb 21, 2025

sharnoff changed the title ~~Bug:~~ Bug: Gaps under load in scheduler's node resource metrics Feb 21, 2025

sharnoff mentioned this issue Feb 21, 2025

plugin/metrics: Update metrics in-place when possible #1277

Merged

sharnoff closed this as completed in #1277 Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Gaps under load in scheduler's node resource metrics #1276

Bug: Gaps under load in scheduler's node resource metrics #1276

sharnoff commented Feb 21, 2025

Bug: Gaps under load in scheduler's node resource metrics #1276

Bug: Gaps under load in scheduler's node resource metrics #1276

Comments

sharnoff commented Feb 21, 2025

Environment

Steps to reproduce

Expected result

Actual result

Other logs, links