Skip to content

Proper procedures for recovering EKS-A cluster from broken state. #9968

@SunghoHong-gif

Description

@SunghoHong-gif

I have a question about recovering EKS-A provisioned clusters from a broken state.

Suppose a cluster has failed machines in both the control plane and worker nodes, and these machines are assumed to be unrecoverable (=physically broken and have to add new baremetal machines). How should this be handled when we want to execute cluster upgrade?

example@example-admin:~$ kubectl get nodes
NAME                STATUS                        ROLES           AGE    VERSION
example-cp3-26       Ready                         control-plane   191d   v1.28.15
example-cp3-27       NotReady,SchedulingDisabled   control-plane   191d   v1.29.13
example-cp5-26       Ready                         control-plane   191d   v1.28.15
example-gpu-wk3-9    NotReady,SchedulingDisabled   <none>          191d   v1.29.13
example-gpu-wk5-11   Ready                         <none>          191d   v1.28.15

Are we expected to manually add healthy control plane and worker nodes to proceed with the cluster upgrade?
Or are we expected to re-provision the cluster from scratch and execute backups?

I’m trying to understand the intended recovery path when the cluster is in an unstable state and cannot be restored using the originally provisioned machines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions