You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
We saw cluster creation slowness when default(m6i.2xlarge) was out of capacity even we had other nodegroups allowing system components in the meanwhile.
We have default(m6i.2xlarge) enabled on zones eu-central-1a, eu-central-1b and eu-central-1c.
The log showed it was out of capacity on both zoneA and zoneC.
However I saw CA marked nodegroup unhealthy on zoneC quickly but it had not marked nodegroup unhealthy on zoneA in more than 20 minutes.
What is suspicious to me is that, for nodegroup on zoneA I saw many logs like,
{"log":"Error while trying to delete nodes from shoot--hc-dev--i502777-2-orc-default-z1: MachineDeployment shoot--hc-dev--myshoot-2-orc-default-z1 is under rolling update , cannot reduce replica count","pid":"1","severity":"WARN","source":"static_autoscaler.go:898"}
But I did not see similar log for nodegroup on zoneC.
What you expected to happen:
Nodegroup should be backed off fast for ResourceExhausted error in any situation.
How to reproduce it (as minimally and precisely as possible):
There is no easy way to simulate node type out of capacity.
Anything else we need to know:
N/A
Environment:
N/A
The text was updated successfully, but these errors were encountered:
What happened:
We saw cluster creation slowness when default(m6i.2xlarge) was out of capacity even we had other nodegroups allowing system components in the meanwhile.
We have default(m6i.2xlarge) enabled on zones eu-central-1a, eu-central-1b and eu-central-1c.
The log showed it was out of capacity on both zoneA and zoneC.
However I saw CA marked nodegroup unhealthy on zoneC quickly but it had not marked nodegroup unhealthy on zoneA in more than 20 minutes.
What is suspicious to me is that, for nodegroup on zoneA I saw many logs like,
{"log":"Error while trying to delete nodes from shoot--hc-dev--i502777-2-orc-default-z1: MachineDeployment shoot--hc-dev--myshoot-2-orc-default-z1 is under rolling update , cannot reduce replica count","pid":"1","severity":"WARN","source":"static_autoscaler.go:898"}
But I did not see similar log for nodegroup on zoneC.
What you expected to happen:
Nodegroup should be backed off fast for ResourceExhausted error in any situation.
How to reproduce it (as minimally and precisely as possible):
There is no easy way to simulate node type out of capacity.
Anything else we need to know:
N/A
Environment:
N/A
The text was updated successfully, but these errors were encountered: