Proper procedures for recovering EKS-A cluster from broken state.

I have a question about recovering EKS-A provisioned clusters from a broken state.

Suppose a cluster has failed machines in both the control plane and worker nodes, and these machines are assumed to be unrecoverable (=physically broken and have to add new baremetal machines). How should this be handled when we want to execute cluster upgrade?

```
example@example-admin:~$ kubectl get nodes
NAME                STATUS                        ROLES           AGE    VERSION
example-cp3-26       Ready                         control-plane   191d   v1.28.15
example-cp3-27       NotReady,SchedulingDisabled   control-plane   191d   v1.29.13
example-cp5-26       Ready                         control-plane   191d   v1.28.15
example-gpu-wk3-9    NotReady,SchedulingDisabled   <none>          191d   v1.29.13
example-gpu-wk5-11   Ready                         <none>          191d   v1.28.15
```

Are we expected to manually add healthy control plane and worker nodes to proceed with the cluster upgrade?
Or are we expected to re-provision the cluster from scratch and execute backups?

I’m trying to understand the intended recovery path when the cluster is in an unstable state and cannot be restored using the originally provisioned machines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proper procedures for recovering EKS-A cluster from broken state. #9968

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proper procedures for recovering EKS-A cluster from broken state. #9968

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions