Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Gaps under load in scheduler's node resource metrics #1276

Closed
sharnoff opened this issue Feb 21, 2025 · 0 comments · Fixed by #1277
Closed

Bug: Gaps under load in scheduler's node resource metrics #1276

sharnoff opened this issue Feb 21, 2025 · 0 comments · Fixed by #1277
Assignees
Labels
c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/bug Issue Type: Bug

Comments

@sharnoff
Copy link
Member

Environment

Production

Steps to reproduce

Put the scheduler under load (e.g., sustaining >60 reconcile operations per second).

Expected result

The scheduler plugin's node resource metrics should have no gaps.

Actual result

There's occasional, small gaps in the metrics. For example:

screenshot of grafana panel titled "percent CPU reserved", showing a single series with a gap in the middle where one datapoint is missing

Other logs, links

  • ...
@sharnoff sharnoff added c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/bug Issue Type: Bug labels Feb 21, 2025
@sharnoff sharnoff self-assigned this Feb 21, 2025
@sharnoff sharnoff changed the title Bug: Bug: Gaps under load in scheduler's node resource metrics Feb 21, 2025
sharnoff added a commit that referenced this issue Feb 21, 2025
Fixes #1276.

Currently, the way we update these node metrics is by removing all the
old ones, then adding back the current values.

If metrics are scraped in between removing the old and adding the new,
we can end up with single-datapoint gaps for one node at a time.

So to fix this, we should avoid removing the old metrics if and only if
the labels are unchanged -- which we can check just by storing the
previous labels we used.
sharnoff added a commit that referenced this issue Feb 26, 2025
Fixes #1276.

Currently, the way we update these node metrics is by removing all the
old ones, then adding back the current values. We do it that way so that
the old values can be cleaned up when there's label changes.

However: if metrics are scraped in between removing the old and adding
the new, we can end up with single-datapoint gaps for one node at a
time.

So to fix this, we should avoid removing the old metrics if and only if
the labels are unchanged -- which we can check just by storing the
previous labels we used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant