Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide capability to hibernate and wake-up etcd clusters and handle it completely via etcd-druid #922

Open
unmarshall opened this issue Nov 13, 2024 · 0 comments
Labels
area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension

Comments

@unmarshall
Copy link
Contributor

unmarshall commented Nov 13, 2024

How to categorize this issue?

/area control-plane
/kind enhancement

What would you like to be added:

In Gardener use case shoots are hibernated and as part of hibernating a shoot the following is done:

  1. A full snapshot is taken for clusters where backup is enabled. This ensures that when the etcd-cluster wakes up it does not lose any data and can reliably restore from the last known state.
  2. etcd-cluster is scaled down to 0.

This issue proposes that hibernation and wake-up from hibernation be offered as functionalities directly in etcd-druid which can be used in and outside of gardener context. The consumer can mark an etcd-cluster to be hibernated and watch the Etcd.Status to know the status of hibernation. The steps that are done in order to hibernate a cluster are determined based on the Etcd resource itself.

Following points can be considered when defining the API:

  • API provision should be made to signal request to hibernate and request to wake-up.
  • An etcd cluster can be created with backup disabled. For these clusters hibernation will not really include taking a full snapshot as there is no safety net configured to backup full and delta snapshots.
  • Optimize costs on cluster hibernation #859 talks about how to optimize costs when hibernating etcd clusters. This could be made configurable and could be used for clusters which do not wish to have a backup but would want to preserve the data only via network attached disks that are attached to the node (PV) and used by etcd pods. (This is optional).
  • Taking and uploading a full snapshot could error out. One could offer two modes to take backups before scaling down - preferred | required. If it is required then consider providing a timeout beyond which etcd-druid will no longer retry and will report failure. Manual intervention is then required to correct the issue blocking the taking of full snapshot and uploading it to a bucket and then the operation can be re-triggered.

Why is this needed:

Hibernation and wake up of etcd clusters are already supported in gardener via reconcile loops running in gardenlet. However it is not exposed as a functionality to non-gardener users. Having a clear and well-defined API to signal hibernation and wake-up of an etcd cluster would ease consumption. It also semantically makes sense for etcd-druid (an etcd operator) to abstract all activities that are performed as part of hibernation and wake-up reducing the burden on the consumers to understand the intricacies/details.

@gardener-robot gardener-robot added area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension labels Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

2 participants