-
Notifications
You must be signed in to change notification settings - Fork 132
Use ToBeDeletedByClusterAutoscaler Taint to improve load balancing during machine terminations
#1054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks for the PR @dergeberl! I agree with the need for the PR, however I have a reservation with the taint that is being used. |
The label dose not trigger the kube-proxy healthz check to fail. This would only result in a reconcile of the service (cloud-controller-manager) but there was a change in the cloud-provider package to reduce the number of requests against the infrastructure (load balancer) API. But this would also solve our issue with that.
I agree. I also dont like that the |
| node, err := c.nodeLister.Get(getNodeName(machine)) | ||
| if err != nil { | ||
| if !apierrors.IsNotFound(err) { | ||
| klog.Errorf("Error occurred while trying to fetch node object - err: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, since go convention is that errors should not be capitalised, also %v is a more appropriate formatter for errors.
I understand that the rest of the file contains many examples of exactly this, but we will correct this soon for the other places
| klog.Errorf("Error occurred while trying to fetch node object - err: %s", err) | |
| klog.Errorf("error occurred while trying to fetch node object - err: %v", err) |
| }, | ||
| // Let the clone.Status.CurrentStatus (LastUpdateTime) be as it was before. | ||
| // This helps while computing when the drain timeout to determine if force deletion is to be triggered. | ||
| // Ref - https://github.com/gardener/machine-controller-manager/blob/rel-v0.34.0/pkg/util/provider/machinecontroller/machine_util.go#L872 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ref link does not exist. Could you please correct it?
| ) | ||
| } | ||
|
|
||
| node.Spec.Taints = append(node.Spec.Taints, toBeDeletedTaint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially cause duplicate taints on the node. Do you think this could cause an issue?
Maybe consider using AddOrUpdateTaint() in that case
What this PR does / why we need it:
If a machine is deleted the it is possible that the load balancer do not know yet that the node is not available anymore. This results eventually in unsuccessful new connections as the node where the load balancer is forwarding the traffic is not existing anymore.
There are 2 ways a load balancer gets informed: 1. Via a reconcile of the load balancer in the
cloud-providerpackage or thekube-proxyhealth check. Unfortunately the "drain" or conditions set by the MCM dose not trigger a reconcile or the health check to fail.The
ToBeDeletedByClusterAutoscalerTaint is used in bothcloud-providerandkube-proxy, therefore we want to add this Taint to improve the load balancing during machine terminations.This is new step is added between
Initiate node drain/Delete Volume AttachmentsandInitiate VM deletionwith one additionalShortRetry(5s) to give thecloud-providerandkube-proxytime to react before removing the server.CC: @kon-angelo
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Release note: