Support Ephemeral Storage #887

vlerenc · 2024-10-08T09:33:19Z

What would you like to be added:
Please support the operation of ETCD with ephemeral persistent volumes (sounds like a contradiction), e.g. hostpath or better/safer yet local, so that network attached persistent volumes can be avoided that are often a scarce machine resource (e.g. AWS can only attach 26 resp. 32 volumes for most machine types; Alicloud and Azure even less).

Why is this needed:
We observe that machines can rarely be fully utilised because of the high ratio of pods-with-volumes to pods-without-volumes in a Gardener managed shoot cluster control plane. If the ETCD for events could be configured to avoid network attached persistent volumes, we could improve the machine utilisation considerably (at the expense of only limited additional network costs to "catch up" when a pod is moved to another node).

Considerations:

Losing 1 of 1 pods (non-HA) or 2 of 3 pods (HA) will result in an unrecoverable permanent quorum loss. Because without network attached persistent volumes this could happen more frequently, ETCD druid should detect that and in the case of ephemeral persistent volumes, discard the statefulset and recreate it from scratch (in the context of events, this seems acceptable in many cases as the default events TTL is anyway only 1h and events are no critical/essential resource for the operation of a cluster).
While backup and restore can be added (later), it doesn't have to be added right from the start. Whoever uses ETCD druid should have the liberty to decide for ephemeral persistent volumes.
In order to stick to stateful sets (we don't have to, but it would make things easier), we need to find a PV(C) type that would work for us, e.g. local. So we need to experiment with it and see whether it works as expected, can be dynamically configured (now multiple ETCD pods would need different local paths on the node), and also the cleanup works (data is deleted once the pod is descheduled from the node).

renormalize · 2025-01-01T09:58:08Z

Sorry for the delay @vlerenc.
Also would like to loop in @gardener/etcd-druid-maintainers so we're all on the same page.

There's a few strategies that we can take to solve this issue, and I will be describing the pros and cons for each. I've also performed tests with each of these strategies, and at the moment I'm not convinced there's a clear winner between two strategies; but I've eliminated the rest.

Kubernetes provides the following volume types that are relevant to running etcd pods utilizing storage on the node:

local: The lifecycle of this storage is not coupled to the lifecycle of the pods. These volumes have to be provisioned statically; dynamic provisioning is explicitly not supported.
hostPath: The lifecycle of this storage is not coupled to the lifecycle of the pods. hostPath is very similar to local, and is more involved than local. Kubernetes recommends usage of local to hostPath wherever possible; thus hostPath will not be considered.
emptyDir: The lifecycle of this storage is coupled to the lifecycle of the pods. These volumes can be specified in the Pod template of a StatefulSet.
Generic Ephemeral Volumes: The lifecycle of this storage is coupled to the lifecycle of the pods. These volumes can be specified in the Pod template of a StatefulSet. Any provisioner can be used to provision these volumes, i.e. CSI, or custom provisioners which will enable to provision local volumes dynamically.

`local`

Since static provisioning of local volumes is not realistic at scales at which Gardener operates, evaluation of local volumes was done with the help of https://github.com/rancher/local-path-provisioner which enables dynamic provisioning of local volumes.

rancher/local-path-provisioner creates a StorageClass, typically called local-path, which can be specified for dynamic provisioning of local volumes (volumeClaimTemplates for StatefulSets that are created for an etcd cluster will specify this StorageClass).

Pros:

There is a PersistentVolume, and PersistentVolumeClaim interface for operators of these etcd clusters.

Cons:

PersistentVolumes are created with nodeAffinitys tying each volume, thereby the Pod which requested the creation of each volume to a corresponding particular node. This behavior obviously makes sense because the intent is to use a node's storage.
However, this is a major problem since Pods would be stuck in Pending forever if the node is unavailable or relinquished, due to this nodeAffinity. The PVCs that claim these PVs have to be manually deleted to ensure a new PV is created, which then enables the Pods to get scheduled.
Thus an external actor has to delete these PVCs every time the etcd pod is to be rescheduled onto another node. This has an extremely strong affect on how etcd pods are to be scheduled.
I am unsure if it is under etcd-druid's scope to keep track of which node an etcd pod is running on, and then delete the (old) PVC to enable the scheduling of this pod onto another node.
local volumes are subject to availability of the underlying node. If a node becomes unhealthy, the local volumes becomes inaccessible to the pod, and the pod can not run.
(Source)
An external actor to first create the directory that would correspond to these "local volumes" on the node is needed, and only then can a persistent volume can be created, and in this case, the external actor is local-path-provisioner. It has to be ensured that the helper pod that loca-path-provisioner creates and schedules on each node to create the directory that will be used as a local volumes has low enough privileges that it can be deployed in nodes of seed clusters.

In essence, unlike a typical CSI PV having a nodeAffinity to a particular zone, local PVs will have a nodeAffinity to a particular node. This goes against the spirit of Kubernetes of not relying on a particular instance of a resource, in my opinion.

`emptyDir`

Pros:

The lifecycle of this storage is fully coupled with the lifecycle of the pod. Creation of the directory, cleanup of the used storage when the pod exits, is completely handled by Kubernetes.
Rescheduling the pod on another node is extremely simple as there are no constraints being imposed by this type of volume.

Cons:

The storage is allocated from the node ephemeral storage. If this backing storage gets filled up due to some other source like kubelet logs, or other pods using node storage, the emptyDir may run out of capacity before the limit specified is hit. (Source)
There is no PersistentVolume, and PersistentVolumeClaim interface for operators of these etcd clusters. (However, I am not totally sure if this is to be placed under Cons).

General:

A container crashing does not remove a pod from the node, thus the data in the volume is safe across crashes.
Limits can be set for this type of volume. There is no need to set requests. This is useful to ensure the storage consumed by the etcd pod doesn't balloon.

`Generic Ephemeral Volumes`

The provisioner that is provided by rancher/local-path-provisioner, through the StorageClass local-path can be made use of by these generic ephemeral volumes, and get a hybrid of emptyDir and local volumes' behavior.

Pros:

The lifecycle of this storage is fully coupled with the lifecycle of the pod. Creation of the directory, cleanup of the used storage when the pod exits, is completely handled by Kubernetes.
There is a PersistentVolume, and PersistentVolumeClaim interface for operators of these etcd clusters.
There is no need to delete PVCs manually to handle the nodeAffinity con as seen in local volumes, since the volume is ephemeral. If the pod is to be scheduled onto another node, Kubernetes deletes the PVC and thereby the underlying local volume as well.

Cons:

In cases where the pods of the StatefulSet are not deleted but are suspended, like hibernation of a cluster where the StatefulSet is running, the PVCs will still remain; but the nodes which backed the PVs do not exist anymore, the pods will also remain stuck in Pending.
A new set of PVCs were created after a certain duration, but the etcd cluster never entered a healthy state even with newer PV(C)s. (Would this even be a problem since etcd-druid and the etcd clusters would be running in a seed cluster which will never be hibernated? How would this look in a non-gardener context? Needs a bit more thought from me.)

emptyDir, and Generic Ephemeral Volumes using the StorageClass created through the provisioning capabilities provided by local-path-provisioner, are two promising solutions.

emptyDir requires the least changes in etcd-druid and does not create another dependency; and does not have any scope for edge cases given how simple and straight forward it is.

Generic Ephemeral Volumes is a good middle ground between emptyDir and local, but there is the mentioned edge case as a con which will have to be ironed out.

Adding support for both is trivial if support for one is added, since the code changes are agnostic, as the volume is simply passed to the StatefulSet spec from the etcd spec; and will also give consumers of etcd-druid more choice.

We can decide on how the Gardener project uses this feature based on all the above information, and I'm slightly leaning towards emptyDir for its simplicity.

PS:

Strategies to recover from quorum loss for etcds used for events needs a bit more tinkering from my side, which I will update on later. If people have any suggestions, please feel free to chime in!

Draft changes to etcd-druid, and sample etcd files for emptyDir and Generic Ephemeral Volumes which were used to validate and test the above can be found at https://github.com/renormalize/etcd-druid/tree/storage

shreyas-s-rao · 2025-01-01T11:34:39Z

@renormalize thank you very much for the very detailed analysis and for penning down your thoughts so clearly.

One small correction from my side regarding Generic Ephemeral Volumes.Cons:

In cases where the pods of the StatefulSet are not deleted but are suspended, like hibernation of a cluster, the PVCs will still remain; but the nodes which backed the PVs do not exist anymore, the pods will also remain stuck in Pending.
A new set of PVCs were created after a certain duration, but the etcd cluster never entered a healthy state even with newer PV(C)s. (Would this even be a problem since etcd-druid and the etcd clusters would be running in a seed cluster which will never be hibernated? How would this look in a non-gardener context? Needs a bit more thought from me.)

During hibernation / scale-down of the etcd cluster, ie, when Etcd.spec.replicas=0, then etcd-druid simply sets the sts.spec.replicas to 0 and the pods are deleted, and do not go into a Suspended/Pending state. Additionally, once the pod is scheduled onto a node, it's lifecycle is thereon tied to the node's lifecycle; if the node is deleted/drained, then the pod is deleted and then the sts creates a new pod in its place which is rescheduled to a new node. If no new node is available, only then it would go into a Pending state. While in this state, no storage space has been provisioned yet, since there is no underlying node to begin with.

To me, the more pressing question is, how do we scale the etcd cluster back up. With backups disabled, it's fairly straightforward, since there is no restoration of snapshots involved. So the sts can simply be scaled up from 0 to 3 replicas, and the cluster should be back up, in a fresh state, with no data from the previous run. This could make sense for etcd-events clusters for Gardener. But when backups are enabled, then we have the limitation of having to first scale up to 1 replica and allow restoration to succeed, and then follow it up by scaling up to 3 replicas and allowing two new members to join the cluster. So using ephemeral storage is currently not an option for etcd-main clusters for Gardener.

If a user chooses to use ephemeral storage for their etcd clusters, then we must assume that they are ok with losing the etcd cluster data and we should make this clear to them, because kubernetes pods can die at any time (due to evictions (which can be blocked by PDBs) or node failures which are out of anybody's control), and if the storage is tied to the lifecycle of the pod, then we have a possibility of losing data from all the etcd cluster members at any given point of time. So recovering from quorum loss becomes fairly easy, because the all we need to do is to scale the sts down to 0 replicas, and once all the pods are terminated, then simply scale the sts back to 3 replicas, and the etcd cluster is good as new, which is what we would expect even from a hibernation scenario when using ephemeral storage here.

renormalize · 2025-01-01T11:58:42Z

@shreyas-s-rao thanks for your feedback!

When I was speaking of hibernation, I was speaking from a Shoot cluster (a managed seed) point of view ; should have clarified that. However, I don't really think what I was talking about is too relevant (apologies for it being unclear), and etcd cluster hibernation is a far more important aspect to be discussed.

What I did which led me to talk about Pending pods was the following:

Run etcd-druid and an etcd cluster in a shoot cluster.
Hibernate the shoot cluster to see how etcd-druid and the corresponding etcd cluster react.

After this, I observed that the etcd cluster's pods were stuck in Pending.
In this case, the etcd cluster was never really scaled to 0 replicas before the shoot cluster was brought down; and this is why I saw the behavior I did. But honestly, I don't think it would ever be the case where a cluster running etcd-druid would be hibernated in that fashion.

So with regards to:

During hibernation / scale-down of the etcd cluster, ie, when Etcd.spec.replicas=0, then etcd-druid simply sets the sts.spec.replicas to 0 and the pods are deleted, and do not go into a Suspended/Pending state.

yes, you're 100% correct.

If a shoot cluster (managed seed) is ever to be hibernated, I don't think it's much of a stretch to think that the etcd cluster is explicitly set Etcd.spec.replicas=0 before it is hibernated.

To me, the more pressing question is, how do we scale the etcd cluster back up. With backups disabled, it's fairly straightforward, since there is no restoration of snapshots involved. So the sts can simply be scaled up from 0 to 3 replicas, and the cluster should be back up, in a fresh state, with no data from the previous run. This could make sense for etcd-events clusters for Gardener. But when backups are enabled, then we have the limitation of having to first scale up to 1 replica and allow restoration to succeed, and then follow it up by scaling up to 3 replicas and allowing two new members to join the cluster. So using ephemeral storage is currently not an option for etcd-main clusters for Gardener.

Agreed. It should be made extremely clear that etcd-druid does not guarantee etcd cluster state to any extent when using ephemeral storage. This need not be the case in the future if the necessary changes are made to the scale-out logic, after which backups can be used to successfully scale out from 0 -> 1 and then 1 -> 3.

shreyas-s-rao · 2025-01-01T12:36:48Z

@renormalize this is a completely valid scenario, where the underlying nodes can be destroyed at any time. But did you observe that the existing pod went into Pending state, or was the old pod deleted and the new pod that was spun up by the sts went into Pending state because there was no underlying node to schedule it to? Can you please clarify?

renormalize · 2025-01-01T13:09:26Z

@shreyas-s-rao
A currently running etcd cluster:

[I] ~/go/src/github.com/gardener/etcd-druid (storage)
❯ kubectl get pods
NAME                          READY   STATUS    RESTARTS   AGE
etcd-druid-76d9c9c8b7-j7hnh   1/1     Running   0          3h
etcd-test-0                   2/2     Running   0          2m33s
etcd-test-1                   2/2     Running   0          2m33s
etcd-test-2                   2/2     Running   0          2m33s

[I] ~/go/src/github.com/gardener/etcd-druid (storage)
❯ kubectl get sts
NAME        READY   AGE
etcd-test   3/3     2m43s

[I] ~/go/src/github.com/gardener/etcd-druid (storage)
❯ kubectl get etcd
NAME        READY   QUORATE   ALL MEMBERS READY   BACKUP READY   AGE
etcd-test   true    True      True                               2m48s

The shoot cluster where https://github.com/renormalize/etcd-druid/tree/storage and an etcd cluster are running is then hibernated.

This shoot cluster is then woken up from hibernation.

[N] ~/go/src/github.com/gardener/etcd-druid (storage)
❯ kubectl get pods
NAME                          READY   STATUS    RESTARTS   AGE
etcd-druid-76d9c9c8b7-7lshd   1/1     Running   0          11m
etcd-test-0                   1/2     Running   0          10m
etcd-test-1                   1/2     Running   0          10m
etcd-test-2                   1/2     Running   0          11m

[I] ~/go/src/github.com/gardener/etcd-druid (storage)
❯ kubectl get sts
NAME        READY   AGE
etcd-test   0/3     17m

[I] ~/go/src/github.com/gardener/etcd-druid (storage)
❯ kubectl get etcd
NAME        READY   QUORATE   ALL MEMBERS READY   BACKUP READY   AGE
etcd-test   false   False     False                              17m

[I] ~/go/src/github.com/gardener/etcd-druid (storage)
❯ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                           STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
pvc-36ebf2ba-eae2-4f55-ad13-fb42eb8d3828   8Gi        RWO            Delete           Bound      default/etcd-test-0-etcd-test   local-path     <unset>                          3m1s
pvc-5d29b5c7-5e8b-4d4c-9b87-65212a820fa9   8Gi        RWO            Delete           Released   default/etcd-test-2-etcd-test   local-path     <unset>                          19m
pvc-ac8100ff-1a25-482f-9b39-3206e5e1e533   8Gi        RWO            Delete           Released   default/etcd-test-0-etcd-test   local-path     <unset>                          19m
pvc-c9a67641-5d0c-4bc8-a89f-ada2a042d50a   8Gi        RWO            Delete           Bound      default/etcd-test-2-etcd-test   local-path     <unset>                          2m58s
pvc-f5a98ae6-5ffb-4cf3-822d-47e7135d9293   8Gi        RWO            Delete           Bound      default/etcd-test-1-etcd-test   local-path     <unset>                          3m
pvc-f5b10c18-9250-4da7-99e0-bf03769e7b59   8Gi        RWO            Delete           Released   default/etcd-test-1-etcd-test   local-path     <unset>                          19m

I'm seeing different behavior now - the existing pods do not go to pending, but have entered the Running state. The older PVs which where created before hibernating the shoot cluster still exist, and checking their events show that these PVs were unable to be deleted (since the backing nodes do not exist anymore)

Events:
  Type     Reason              Age                    From                                                                                               Message
  ----     ------              ----                   ----                                                                                               -------
  Warning  VolumeFailedDelete  2m10s (x5 over 8m30s)  rancher.io/local-path_local-path-provisioner-dbff48958-wvtq9_a06cd4ea-c64e-4b2a-b161-7548008730e3  failed to delete volume pvc-5d29b5c7-5e8b-4d4c-9b87-65212a820fa9: failed to delete volume pvc-5d29b5c7-5e8b-4d4c-9b87-65212a820fa9: pods "helper-pod-delete-pvc-5d29b5c7-5e8b-4d4c-9b87-65212a820fa9" not found

Finally, the etcd cluster does not become healthy and ready even after 10 minutes.
There's a bit more to investigate here. Will update this comment with more details.

Will also update my original comment with the corresponding findings.

shreyas-s-rao · 2025-01-01T13:38:41Z

Sure @renormalize , thanks.

renormalize · 2025-01-07T13:57:40Z

The following changes were required in etcd-backup-restore to handle the case when etcd clusters are scaled in to 0 replicas and scaled back out to 3 replicas:

gardener/etcd-backup-restore@master...renormalize:etcd-backup-restore:storage

In essence, without these changes, when the etcd cluster is scaled back out from 0 to 3 replicas, there is no data directory, which causes etcd-backup-restore to enter a restoration flow.
This involves creating clients to (the non-existent) etcd cluster which causes the initialization flow to get stuck.
The (somewhat hack-y) fix is to always bootstrap etcd clusters and not bother with restoration when backups are disabled.

The reason etcd-backup-restore enters a restoration flow is because member leases for the etcd pods are still present when the cluster is scaled in to 0.
#860 is relevant to this issue, and might become a prerequisite. Based on the decisions made in #860, corresponding changes are to be made in etcd-backup-restore.

renormalize · 2025-01-21T10:45:26Z

After a discussion with @vlerenc and @gardener/etcd-druid-maintainers, it was decided that support for ephemeral storage will be provided with emptyDir, due to its simplicity, and identical functionality while using generic ephemeral volumes with local-path-provisioner as the provisioner.

However, there are still a few areas of concern which need to be addressed:

What limits are generally to be recommended for etcd-events?
How many etcd-events pods can be allocated to one node?
How would etcd-events be evenly spread across multiple nodes?
Does cluster-autoscaler respect the limits being set by emptyDir?
How would existing etcd-events clusters move from CSI volumes to emptyDir?

vlerenc added the area/cost Cost related label Oct 8, 2024

renormalize self-assigned this Dec 18, 2024

renormalize added area/control-plane Control plane related area/storage Storage related labels Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Ephemeral Storage #887

Support Ephemeral Storage #887

vlerenc commented Oct 8, 2024 •

edited

Loading

renormalize commented Jan 1, 2025 •

edited

Loading

shreyas-s-rao commented Jan 1, 2025

renormalize commented Jan 1, 2025 •

edited

Loading

shreyas-s-rao commented Jan 1, 2025

renormalize commented Jan 1, 2025 •

edited

Loading

shreyas-s-rao commented Jan 1, 2025

renormalize commented Jan 7, 2025 •

edited

Loading

renormalize commented Jan 21, 2025

Support Ephemeral Storage #887

Support Ephemeral Storage #887

Comments

vlerenc commented Oct 8, 2024 • edited Loading

renormalize commented Jan 1, 2025 • edited Loading

local

emptyDir

Generic Ephemeral Volumes

shreyas-s-rao commented Jan 1, 2025

renormalize commented Jan 1, 2025 • edited Loading

shreyas-s-rao commented Jan 1, 2025

renormalize commented Jan 1, 2025 • edited Loading

shreyas-s-rao commented Jan 1, 2025

renormalize commented Jan 7, 2025 • edited Loading

renormalize commented Jan 21, 2025

vlerenc commented Oct 8, 2024 •

edited

Loading

renormalize commented Jan 1, 2025 •

edited

Loading

`local`

`emptyDir`

`Generic Ephemeral Volumes`

renormalize commented Jan 1, 2025 •

edited

Loading

renormalize commented Jan 1, 2025 •

edited

Loading

renormalize commented Jan 7, 2025 •

edited

Loading