title | authors | reviewers | creation-date | last-updated | status | ||||
---|---|---|---|---|---|---|---|---|---|
MachineDrainRules |
|
|
2024-09-30 |
2024-09-30 |
implementable |
Today, when Cluster API deletes a Machine it drains the corresponding Node to ensure all Pods running on the Node have
been gracefully terminated before deleting the corresponding infrastructure. The current drain implementation has
hard-coded rules to decide which Pods should be evicted. This implementation is aligned to kubectl drain
(see
Machine deletion process
for more details).
With recent changes in Cluster API, we can now have finer control on the drain process, and thus we propose a new
MachineDrainRule
CRD to make the drain rules configurable per Pod. Additionally, we're proposing labels that
workload cluster admins can add to individual Pods to control their drain behavior.
This would be a huge improvement over the “standard” kubectl drain
aligned implementation we have today and help to
solve a family of issues identified when running Cluster API in production.
While the “standard” kubectl drain
rules have served us well, new user stories have emerged where a more sophisticated
behavior is needed.
Story 1: Define drain order
As a cluster admin/user, I would like to define in which order Pods are evicted. For example, I want to configure that Portworx Pods are evicted after all other Pods (CAPI-11024).
Note: As of today the only way to address this use case with Cluster API is to implement a pre-drain hook with a custom drain process. Due to the complexity of this solution it is not viable for most users.
Story 2: Exclude “undrainable” Pods from drain
As a cluster admin/user, I would like the node drain process to ignore pods that will obviously not drain properly (e.g., because they're tolerating the unschedulable or all taints).
Note: As of today Pods that are tolerating the node.kubernetes.io/unschedulable:NoSchedule
taint will just get rescheduled
to the Node after they have been evicted, and this can lead to race conditions and then prevent the Machine to be deleted.
Story 3: Exclude Pods that should not be drained from drain
As a cluster admin/user, I would like to instruct the Node drain process to exclude some Pods, e.g. because I don't care if they're gracefully terminated and they should run until the Node goes offline (e.g. monitoring agents).
Story 4: Define drain order & exclude Pods from drain without management cluster access
As a workload cluster admin/user, I want to be able to define drain order of Pods and also exclude Pods from drain (similar to Story 1-3, just without management cluster access).
Story 5: Apply drain configuration to a set of Machines / Clusters
As a cluster admin/user, I would like to be able to configure the drain behavior for all Machines of a management cluster or for a subset of these Machines without having to configure each Machine individually. I would also like to be able to change the drain configuration without having to modify all Machine objects.
- Allow users to customize how/if Pods are evicted:
- Allow to define the order in which Pods are evicted
- Allow to define if Pods should be excluded from drain
- Use the current drain default rules as a fallback if neither
MachineDrainRule
CRs nor drain annotations apply to a Pod (this also ensures compatibility with current behavior) - Stop waiting for volume detachment for Pods that are excluded from drain
- Change the drain behavior for DaemonSet and static Pods (we’ll continue to skip draining for both). While the drain behavior itself won't be changed, we will stop waiting for detachment of volumes of DaemonSet Pods.
-
Align with the evolving landscape for Node drain in Kubernetes as soon as the new features are available, stable and can be used for the Kubernetes versions Cluster API supports (e.g. KEP-4212, KEP-4563).
Because of the wide range of workload cluster Kubernetes versions supported in Cluster API (n-5), it will take significant time until we can use new Kubernetes features in all supported versions. So an implementation in Cluster API is the only option to improve drain anytime soon (even for the oldest supported Kubernetes versions).
Also, please note that while we would like to align with Kubernetes and leverage its new features as soon as possible, there will always be a delta to be addressed in Cluster API due to the fact that Cluster API is operated from the management cluster, while drain is done in workload clusters.
Additionally, Cluster API should take care of many clusters, while Kubernetes features are by design scoped to a single cluster.
We propose to introduce a new MachineDrainRule
namespace-scoped CRD. The goal of this CRD is to allow management
cluster admins/users to define a drain rule (a specific drain behavior) for a set of Pods.
Additionally, drain rules can be restricted to a specific set of Machines, e.g. targeting only controlplane Machines, Machines of specific Clusters, Linux Machines, etc.
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDrainRule
metadata:
name: example-rule
namespace: default
spec:
drain:
# behavior can be Drain or Skip
behavior: Drain
# order can only be set if behavior == Drain
# Pods with higher order are drained later
order: 100
machines:
- selector:
matchLabels:
matchExpressions:
clusterSelector:
matchLabels:
matchExpressions:
pods:
- selector:
matchLabels:
matchExpressions:
namespaceSelector:
matchLabels:
matchExpressions:
Notes:
- The Machine controller will evaluate
machines
andpods
selectors to determine for which Machine and to which Pods the rule should be applied. - All selectors are of type
metav1.LabelSelector
- In order to allow expressing selectors with AND and OR predicates, both
machines
andpods
have a list of selectors instead of a single selector. Within a single element of the selector lists AND is used, while between multiple entries in the selector lists OR is used.
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDrainRule
metadata:
name: skip-pods
namespace: default
spec:
drain:
behavior: Skip
machines:
- selector: {}
pods:
# This selects all Pods with the app label in (example-app1,example-app2) AND in the example-namespace
- selector:
matchExpressions:
- key: app
operator: In
values:
- example-app1
- example-app2
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: example-namespace
# This additionally selects Pods with the app label == monitoring AND in the monitoring namespace
- selector:
matchExpressions:
- key: app
operator: In
values:
- monitoring
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
(related to CAPI-11024)
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDrainRule
metadata:
name: portworx
namespace: default
spec:
drain:
behavior: Drain
order: 100
machines:
- selector: {}
pods:
- selector:
matchLabels:
app: portworx
We propose to introduce new labels to allow workload cluster admins/users to define drain behavior
for Pods. These labels would be either added directly to Pods or indirectly via Deployments, StatefulSets, etc.
The labels will take precedence over MachineDrainRules
specified in the management cluster.
cluster.x-k8s.io/drain: skip
Initially we also considered adding a cluster.x-k8s.io/drain-order
label. But we're not entirely sure about it
yet as someone who has access to the workload cluster (or maybe only specific Pods) would be able to influence the
drain order of the entire cluster, which might lead to problems. The skip label is safe in comparison because it
only influences the drain behavior of the Pod that has the label.
The following describes the new algorithm that decides which Pods should be drained and in which order.
- List all Pods on the Node
- For each Pod (first match applies):
- If the Pod is a DaemonSet Pod or a static Pod
- => use
behavior: Skip
- => use
- If the Pod has
cluster.x-k8s.io/drain: skip
- => use
behavior: Skip
- => use
- If there is a matching
MachineDrainRule
- => use
behavior
andorder
from the first matchingMachineDrainRule
(based on alphabetical order)
- => use
- Otherwise:
- => use
behavior: Drain
andorder: 0
- => use
- If the Pod is a DaemonSet Pod or a static Pod
- If there are no more Pods to be drained
- => drain is completed
- Else:
- => evict all Pods with the lowest
order
, update condition and requeue
- => evict all Pods with the lowest
Notes:
- It does not make sense to evict DaemonSet Pods, because the DaemonSet controller always adds tolerations for the unschedulable taint to all its Pods so they would be just immediately rescheduled.
- It does not make sense to evict static Pods, because the kubelet won’t actually terminate them and the Pod object will get recreated.
- If multiple
MachineDrainRules
match the same Pod, the “first” rule (based on alphabetical order) will be applied- Alphabetical order is used because:
- There is no way to ensure only a single
MachineDrainRule
applies to a specific Pod, because not all Pods and their labels are known whenMachineDrainRules
are created/updated especially as the Pods will also change over time. - So a criteria is needed to decide which
MachineDrainRule
is used when multipleMachineDrainRules
apply - For now, we decided to use simple alphabetical order. An alternative would have been to add a
priority
field toMachineDrainRules
and enforce that each priority is used only once, but this is not entirely race condition free (when twoMachineDrainRules
are created "at the same time") and adding "another" priority field on top oforder
can be confusing to users.
- There is no way to ensure only a single
- Alphabetical order is used because:
Pods
with drain behaviorDrain
and given order will be evicted only after all otherPods
with a lower order have been already deleted- When waiting for volume detach, volumes belonging to
Pods
with drain behaviorSkip
are going to be ignored. Or in other words, Machine deletion can continue if only volumes belonging toPods
with drain behaviorSkip
are still attached to the Node.
Today, after Node drain we are waiting for all volumes to be detached. We are going to change that behavior to ignore all attached volumes that belong to Pods for which we skipped the drain.
Please note, today the only Pods for which we skip drain that can have volumes are DaemonSet Pods. If a DaemonSet Pod
has a volume currently wait for volume detach would block indefinitely. The only way around this today is to set either
the Machine.spec.nodeVolumeDetachTimeout
field or the machine.cluster.x-k8s.io/exclude-wait-for-node-volume-detach
annotation. With this change we will stop waiting for volumes of DaemonSet Pods to be detached.
Note: We can't exclude the possibility that with some CSI implementations deleting Machines while volumes are still attached
will lead to problems. So please use this carefully. However, we verified that the kube-controller-manager
will handle these
situations correctly by ensuring the volume gets freed up after the Machine & Node have been deleted and can be used again.
This proposal will add a new MachineDrainRule
CRD. The Cluster API core controller needs permissions to read this CRD.
While permissions to write the CRD would be typically given to the cluster admin (the persona who has permissions to manage
Cluster CRs), it is also possible to further restrict access.
- There is ongoing work in Kubernetes in this area that we should eventually align with (KEP-4212,
KEP-4563)
- We are going to provide feedback to the KEPs to try to make sure the same use cases as in the current proposal are covered.
Add Node drain configuration to the Machine object
We considered adding the drain rules directly to the Machine objects (and thus also to the MachineDeployment & MachineSet objects) instead. We discarded this option because it would have made it necessary to configure Node drain on every single Machine. By having a separate CRD it is now possible to configure the Node drain for all Clusters / Machines or a specific subset of Machines at once. This also means that the Node drain configuration can be immediately changed without having to propagate configuration changes to all Machines.
MachineDrainRequestTemplates We considered defining separate MachineDrainRequest and MachineDrainRequestTemplate CRDs. The MachineDrainRequestTemplates would be used to define the drain behavior. When the Machine controller starts a drain a MachineDrainRequest would be created based on the applicable MachineDrainRequestTemplates. The MachineDrainRequest would then be used to reconcile and track the drain. Downside would be that if you have a lot of templates, you make a really large custom resource that is unwieldy to manage. Additionally it is not clear if it's necessary to have a separate MachineDrainRequest CRD, it's probably easier to track the drain directly on the Machine object.
MachineDrainRules
are orthogonal to the state of the Cluster / Machines as they only configure how the Machine controller
should drain Nodes. Accordingly, they are not part of the Machine specs. Thus, as soon as the new Cluster API version that
supports this feature is deployed, MachineDrainRules
can be immediately used without rolling out / re-configuring any
Clusters / Machines.
Please note that while the drain behavior of DaemonSet Pods won't change (we’ll continue to skip draining), we will stop waiting for detachment of volumes of DaemonSet Pods.
- Extensive unit test coverage for the Node drain code for all supported cases
- Extend the Node drain e2e test to cover draining Pods using various
MachineDrainRules
and the labels (including validation of condition messages).
The MachineDrainRules
CRD will be added as v1beta1
CRD to the cluster.x-k8s.io
apiGroup.
An additional feature gate is not required as the behavior of the Machine controller will stay the same if neither
MachineDrainRule
CRDs nor the labels are used.
No limitations apply, the new feature will not depend on any external APIs.
- 2024-09-30: Open proposal PR
- 2024-10-02: Present proposal at a community meeting