Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Etcd druid to set tolerations, node affinity and TSC policies for HA etcd clusters #899

Open
unmarshall opened this issue Oct 28, 2024 · 0 comments
Labels
area/control-plane Control plane related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension

Comments

@unmarshall
Copy link
Contributor

unmarshall commented Oct 28, 2024

How to categorize this issue?

/area control-plane
/area high-availability
/kind enhancement

What would you like to be added:
Add capability in etcd-druid to determine and add NodeAffinity, TSC and Tolerations to etcd StatefulSet pods for a HA etcd cluster.

Why is this needed:

When etcd-druid is used in gardener today then gardener-resource-manager HA webhook does the following:

  • Mutates the replicas (see code)
  • Mutates the node affinity (see code)
  • Mutates TSC (see code)
  • Mutates Toleration seconds (see code) - This is done to allow faster recovery for HA clusters in a single zone.

Unfortunately while mutating the TSC policies the LabelSelector is set taking the labels from PodTemplateSpec.Labels (of the respective STS). See code. This is problematic as new labels can be added during an upgrade and all labels are not used to uniquely identify the pods belonging to a StatefulSet (i.e. an etcd cluster).

So imagine the following scenario:
Starting State:
There is an non-HA etcd-cluster (replicas=1). Lets assume that the pods of a StatefulSet provisioned for the etcd cluster has the following labels:

app.kubernetes.io/name: etcd-test
app.kubernetes.io/managed-by: etcd-druid
app.kubernetes.io/part-of: etcd-test
app.kubernetes.io/component: statefulset

Pod etcd-test-0 is currently scheduled in zone-A.

Upgrade etcd cluster to HA

  • The cluster is upgraded to HA (replicas=3) and new label druid.gardener.cloud/etcd-cluster-size is added to the STS.
  • When etcd-test-1 and etcd-test-2 come up they have the new label as well.
  • HA webhook injects the TSC taking all labels which includes the new label as well. So while evaluating the TSC only etcd-pod-1 and etcd-pod-2 pods will be visible. etcd-pod-0 is not included in the set as the labels differ.
  • It is a possibility that one of these pods gets scheduled in zone-A and lets assume the other pod gets scheduled in zone-B.
  • After these 2 pods have started, etcd-pod-0 is now updated and post update it has the new label as well. So now when TSC is evaluated again then this pod cannot be placed in zone-A because there is already one more pod there and TSC says maxSkew across zones is 1. It also cannot be scheduled onto any other zone because its PV is bound to zone-A therefore this pod will remain pending.

After discussing with @timuthy it was agreed that since there is no generic way to find out subset of labels that will be used to uniquely create the label selector for a StatefulSet, therefore it is prudent to allow etcd-druid to set the TSC. Since etcd-druid is setting the TSC it can also then set the other things as well - tolerations and node affinity.

Gardener should still mutate the replicas to 3.

@gardener-robot gardener-robot added area/control-plane Control plane related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension labels Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

2 participants