ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation #330

dongyingbo · 2024-10-21T12:50:10Z

What happened:
We saw cluster creation slowness when default(m6i.2xlarge) was out of capacity even we had other nodegroups allowing system components in the meanwhile.

We have default(m6i.2xlarge) enabled on zones eu-central-1a, eu-central-1b and eu-central-1c.
The log showed it was out of capacity on both zoneA and zoneC.
However I saw CA marked nodegroup unhealthy on zoneC quickly but it had not marked nodegroup unhealthy on zoneA in more than 20 minutes.
What is suspicious to me is that, for nodegroup on zoneA I saw many logs like,
{"log":"Error while trying to delete nodes from shoot--hc-dev--i502777-2-orc-default-z1: MachineDeployment shoot--hc-dev--myshoot-2-orc-default-z1 is under rolling update , cannot reduce replica count","pid":"1","severity":"WARN","source":"static_autoscaler.go:898"}
But I did not see similar log for nodegroup on zoneC.

What you expected to happen:
Nodegroup should be backed off fast for ResourceExhausted error in any situation.

How to reproduce it (as minimally and precisely as possible):
There is no easy way to simulate node type out of capacity.

Anything else we need to know:
N/A

Environment:
N/A

dongyingbo · 2024-10-21T12:51:11Z

Is it something can be improved by new flags planed in #176?

dongyingbo · 2024-10-22T03:48:26Z

Closing as I can not provide detailed log for now.

dongyingbo added the kind/bug Bug label Oct 21, 2024

dongyingbo closed this as completed Oct 22, 2024

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation #330

ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation #330

dongyingbo commented Oct 21, 2024

dongyingbo commented Oct 21, 2024

dongyingbo commented Oct 22, 2024

ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation #330

ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation #330

Comments

dongyingbo commented Oct 21, 2024

dongyingbo commented Oct 21, 2024

dongyingbo commented Oct 22, 2024