Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation #330

Closed
dongyingbo opened this issue Oct 21, 2024 · 2 comments
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)

Comments

@dongyingbo
Copy link

What happened:
We saw cluster creation slowness when default(m6i.2xlarge) was out of capacity even we had other nodegroups allowing system components in the meanwhile.

We have default(m6i.2xlarge) enabled on zones eu-central-1a, eu-central-1b and eu-central-1c.
The log showed it was out of capacity on both zoneA and zoneC.
However I saw CA marked nodegroup unhealthy on zoneC quickly but it had not marked nodegroup unhealthy on zoneA in more than 20 minutes.
What is suspicious to me is that, for nodegroup on zoneA I saw many logs like,
{"log":"Error while trying to delete nodes from shoot--hc-dev--i502777-2-orc-default-z1: MachineDeployment shoot--hc-dev--myshoot-2-orc-default-z1 is under rolling update , cannot reduce replica count","pid":"1","severity":"WARN","source":"static_autoscaler.go:898"}
But I did not see similar log for nodegroup on zoneC.

What you expected to happen:
Nodegroup should be backed off fast for ResourceExhausted error in any situation.

How to reproduce it (as minimally and precisely as possible):
There is no easy way to simulate node type out of capacity.

Anything else we need to know:
N/A

Environment:
N/A

@dongyingbo dongyingbo added the kind/bug Bug label Oct 21, 2024
@dongyingbo
Copy link
Author

Is it something can be improved by new flags planed in #176?

@dongyingbo
Copy link
Author

Closing as I can not provide detailed log for now.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

2 participants