Skip to content

Conversation

@dergeberl
Copy link

What this PR does / why we need it:

If a machine is deleted the it is possible that the load balancer do not know yet that the node is not available anymore. This results eventually in unsuccessful new connections as the node where the load balancer is forwarding the traffic is not existing anymore.

There are 2 ways a load balancer gets informed: 1. Via a reconcile of the load balancer in the cloud-provider package or the kube-proxy health check. Unfortunately the "drain" or conditions set by the MCM dose not trigger a reconcile or the health check to fail.

The ToBeDeletedByClusterAutoscaler Taint is used in both cloud-provider and kube-proxy, therefore we want to add this Taint to improve the load balancing during machine terminations.

This is new step is added between Initiate node drain/Delete Volume Attachments and Initiate VM deletion with one additional ShortRetry (5s) to give the cloud-provider and kube-proxy time to react before removing the server.

CC: @kon-angelo

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Use `ToBeDeletedByClusterAutoscaler` Taint to improve load balancing during machine terminations.

@dergeberl dergeberl requested a review from a team as a code owner November 25, 2025 15:36
@gardener-robot gardener-robot added needs/review Needs review size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs/second-opinion Needs second review by someone else labels Nov 25, 2025
@aaronfern aaronfern self-assigned this Dec 9, 2025
@aaronfern
Copy link
Member

Thanks for the PR @dergeberl!

I agree with the need for the PR, however I have a reservation with the taint that is being used.
Using a ClusterAutoscaler taint is a little misleading as we can have cases where CA is not in the picture, or even not deployed. In these cases, the presence of this taint on a node can be a bit misleading.
Was there a reason this taint was chosen instead of the exclude-from-external-load-balancers label?

@dergeberl
Copy link
Author

Was there a reason this taint was chosen instead of the exclude-from-external-load-balancers label?

The label dose not trigger the kube-proxy healthz check to fail. This would only result in a reconcile of the service (cloud-controller-manager) but there was a change in the cloud-provider package to reduce the number of requests against the infrastructure (load balancer) API. But this would also solve our issue with that.

the presence of this taint on a node can be a bit misleading.

I agree. I also dont like that the kube-proxy and cloud-provider uses this taint. In my opinion there should be a generic DoBeDeleted Taint which can be used by cluster autoscaler but also for other solutions (like the MCM in our usecase). Unfortunately it seems the discussions around this are stale and no progress in the last month.

node, err := c.nodeLister.Get(getNodeName(machine))
if err != nil {
if !apierrors.IsNotFound(err) {
klog.Errorf("Error occurred while trying to fetch node object - err: %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, since go convention is that errors should not be capitalised, also %v is a more appropriate formatter for errors.
I understand that the rest of the file contains many examples of exactly this, but we will correct this soon for the other places

Suggested change
klog.Errorf("Error occurred while trying to fetch node object - err: %s", err)
klog.Errorf("error occurred while trying to fetch node object - err: %v", err)

},
// Let the clone.Status.CurrentStatus (LastUpdateTime) be as it was before.
// This helps while computing when the drain timeout to determine if force deletion is to be triggered.
// Ref - https://github.com/gardener/machine-controller-manager/blob/rel-v0.34.0/pkg/util/provider/machinecontroller/machine_util.go#L872
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ref link does not exist. Could you please correct it?

)
}

node.Spec.Taints = append(node.Spec.Taints, toBeDeletedTaint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially cause duplicate taints on the node. Do you think this could cause an issue?
Maybe consider using AddOrUpdateTaint() in that case

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants