Use `ToBeDeletedByClusterAutoscaler` Taint to improve load balancing during machine terminations #1054

dergeberl · 2025-11-25T15:36:22Z

What this PR does / why we need it:

If a machine is deleted the it is possible that the load balancer do not know yet that the node is not available anymore. This results eventually in unsuccessful new connections as the node where the load balancer is forwarding the traffic is not existing anymore.

There are 2 ways a load balancer gets informed: 1. Via a reconcile of the load balancer in the cloud-provider package or the kube-proxy health check. Unfortunately the "drain" or conditions set by the MCM dose not trigger a reconcile or the health check to fail.

The ToBeDeletedByClusterAutoscaler Taint is used in both cloud-provider and kube-proxy, therefore we want to add this Taint to improve the load balancing during machine terminations.

This is new step is added between Initiate node drain/Delete Volume Attachments and Initiate VM deletion with one additional ShortRetry (5s) to give the cloud-provider and kube-proxy time to react before removing the server.

CC: @kon-angelo

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Use `ToBeDeletedByClusterAutoscaler` Taint to improve load balancing during machine terminations.

aaronfern · 2025-12-15T11:47:28Z

Thanks for the PR @dergeberl!

I agree with the need for the PR, however I have a reservation with the taint that is being used.
Using a ClusterAutoscaler taint is a little misleading as we can have cases where CA is not in the picture, or even not deployed. In these cases, the presence of this taint on a node can be a bit misleading.
Was there a reason this taint was chosen instead of the exclude-from-external-load-balancers label?

dergeberl · 2025-12-17T12:06:48Z

Was there a reason this taint was chosen instead of the exclude-from-external-load-balancers label?

The label dose not trigger the kube-proxy healthz check to fail. This would only result in a reconcile of the service (cloud-controller-manager) but there was a change in the cloud-provider package to reduce the number of requests against the infrastructure (load balancer) API. But this would also solve our issue with that.

the presence of this taint on a node can be a bit misleading.

I agree. I also dont like that the kube-proxy and cloud-provider uses this taint. In my opinion there should be a generic DoBeDeleted Taint which can be used by cluster autoscaler but also for other solutions (like the MCM in our usecase). Unfortunately it seems the discussions around this are stale and no progress in the last month.

aaronfern · 2025-12-22T07:06:45Z

pkg/util/provider/machinecontroller/machine_util.go

+	node, err := c.nodeLister.Get(getNodeName(machine))
+	if err != nil {
+		if !apierrors.IsNotFound(err) {
+			klog.Errorf("Error occurred while trying to fetch node object - err: %s", err)


nit, since go convention is that errors should not be capitalised, also %v is a more appropriate formatter for errors.
I understand that the rest of the file contains many examples of exactly this, but we will correct this soon for the other places

Suggested change

klog.Errorf("Error occurred while trying to fetch node object - err: %s", err)

klog.Errorf("error occurred while trying to fetch node object - err: %v", err)

aaronfern · 2025-12-22T07:23:05Z

pkg/util/provider/machinecontroller/machine_util.go

+			},
+			// Let the clone.Status.CurrentStatus (LastUpdateTime) be as it was before.
+			// This helps while computing when the drain timeout to determine if force deletion is to be triggered.
+			// Ref - https://github.com/gardener/machine-controller-manager/blob/rel-v0.34.0/pkg/util/provider/machinecontroller/machine_util.go#L872


This ref link does not exist. Could you please correct it?

aaronfern · 2025-12-22T08:27:17Z

pkg/util/provider/machinecontroller/machine_util.go

+		)
+	}
+
+	node.Spec.Taints = append(node.Spec.Taints, toBeDeletedTaint)


This could potentially cause duplicate taints on the node. Do you think this could cause an issue?
Maybe consider using AddOrUpdateTaint() in that case

Add ToBeDeletedByClusterAutoscaler taint after node drain

38fd490

dergeberl requested a review from a team as a code owner November 25, 2025 15:36

gardener-robot added needs/review Needs review size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs/second-opinion Needs second review by someone else labels Nov 25, 2025

aaronfern self-assigned this Dec 9, 2025

aaronfern requested changes Dec 22, 2025

View reviewed changes

gardener-robot added the needs/changes Needs (more) changes label Dec 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `ToBeDeletedByClusterAutoscaler` Taint to improve load balancing during machine terminations #1054

Use `ToBeDeletedByClusterAutoscaler` Taint to improve load balancing during machine terminations #1054

Uh oh!

dergeberl commented Nov 25, 2025

Uh oh!

aaronfern commented Dec 15, 2025

Uh oh!

dergeberl commented Dec 17, 2025

Uh oh!

aaronfern Dec 22, 2025

Uh oh!

aaronfern Dec 22, 2025

Uh oh!

aaronfern Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	klog.Errorf("Error occurred while trying to fetch node object - err: %s", err)
	klog.Errorf("error occurred while trying to fetch node object - err: %v", err)

Use ToBeDeletedByClusterAutoscaler Taint to improve load balancing during machine terminations #1054

Are you sure you want to change the base?

Use ToBeDeletedByClusterAutoscaler Taint to improve load balancing during machine terminations #1054

Uh oh!

Conversation

dergeberl commented Nov 25, 2025

Uh oh!

aaronfern commented Dec 15, 2025

Uh oh!

dergeberl commented Dec 17, 2025

Uh oh!

aaronfern Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

aaronfern Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

aaronfern Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use `ToBeDeletedByClusterAutoscaler` Taint to improve load balancing during machine terminations #1054

Use `ToBeDeletedByClusterAutoscaler` Taint to improve load balancing during machine terminations #1054