AWSMachinePool does not drain nodes during scale-in #2023

dthorsen · 2020-10-13T18:39:00Z

/kind bug

What steps did you take and what happened:

Create a workload cluster with the experimental EKS Control Plane.
Create a MachinePool with replicas: 5 and create the associated AWSMachinePool resources. (Note: this AWSMachinePool is not managed by cluster-autoscaler)
Create a deployment and scale it such that some pods fall on all machines
Create a PDB protecting the deployment with maxUnavailable: 1
Scale the MachinePool down to replicas: 3

This caused the AWSMachineController to set the DesiredInstances in the ASG to 3 without draining nodes at all. The PDB was not honored, and the EC2 instances were terminated by the ASG immediately.

What did you expect to happen:
The nodes should have drained gracefully before the EC2 instances are terminated.

Anything else you would like to add:
In the current AWSMachinePool implementation, the instance selection for scale-in is performed at the AutoScalingGroup. This could be fixed in the non-cluster-autoscaler case by modifying AWSMachinePool controller to perform node selection for scale-in, drain the selected nodes, and finally utilize the AWS TerminateInstanceInAutoScalingGroup action while setting the request value ShouldDecrementDesiredCapacity: true

We may want to also consider a lifecycle hook on the autoscaling group that prevents ec2 instance termination until the drain completes. This would help to prevent cases where instances are forcibly terminated without draining when the DesiredInstances values are manipulated via the EC2 console, CLI, or APIs.

Environment:

Cluster-api-provider-aws version: Commit: 3338cd4
Kubernetes version: (use kubectl version): v.1.17.9
OS (e.g. from /etc/os-release): Amazon Linux 2

The text was updated successfully, but these errors were encountered:

fejta-bot · 2021-02-07T14:47:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-03-09T15:33:18Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

randomvariable · 2021-03-11T19:10:09Z

Chatting with @sedefsavas
AWS Node Termination Handler ( https://github.com/aws/aws-node-termination-handler ) can help, but doesn't fully eliminate it - it gives a 2 minute warning.

Sync with CAPZ on MachinePool v.Next

@kschumy , any ideas on what we should do here?

sedefsavas · 2021-03-23T03:50:14Z

We can follow a similar approach with Openshift's POC about polling termination endpoint:
https://github.com/openshift/cluster-api-provider-aws/blob/b4a3478db44ddb554883cf77a9e5f49ffd54fdf4/pkg/termination/handler.go

More on this is discussed in the cluster-api proposal: kubernetes-sigs/cluster-api#3528

fejta-bot · 2021-04-22T04:17:15Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-22T04:17:22Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

richardcase · 2023-03-22T15:08:47Z

/reopen
/remove-lifecycle rotten

richardcase · 2023-04-03T16:30:32Z

From office hours 2023-04-03:

This will potentially be handled by Implement MachinePool Machines clusterAPI proposal #4184.
Providers refresh will have a weakness as aws only give a small amount of time before termination (same issue with AWSManagedMachinePools)
Users expectation is that nodes are drained

/triage accepted
/priority important-soon

dlipovetsky · 2023-04-03T16:31:41Z

Also from office hours discussion:

Users define Pod Disruption Budgets to ensure that their Pods are not voluntarily deleted.

A scale-in of a MachinePool, if it uses the "providers refresh", will always proceed, even if it violates a budget.

For comparison, a scale-in of a MachineDeployment will never proceed if it violates a budget.

k8s-triage-robot · 2023-07-02T17:06:31Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2024-01-23T16:50:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-22T16:56:04Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 13, 2020

dthorsen mentioned this issue Oct 14, 2020

REQUEST: New membership for dthorsen kubernetes/org#2266

Closed

6 tasks

randomvariable added this to the v0.6.x milestone Nov 9, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 9, 2021

randomvariable modified the milestones: v0.6.x, v0.7.0 Mar 11, 2021

k8s-ci-robot closed this as completed Apr 22, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 22, 2023

richardcase reopened this Mar 22, 2023

k8s-ci-robot added needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 22, 2023

fiunchinho mentioned this issue Mar 22, 2023

Machine pool nodes are not drained during upgrade giantswarm/roadmap#2170

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 2, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWSMachinePool does not drain nodes during scale-in #2023

AWSMachinePool does not drain nodes during scale-in #2023

dthorsen commented Oct 13, 2020 •

edited

Loading

fejta-bot commented Feb 7, 2021

fejta-bot commented Mar 9, 2021

randomvariable commented Mar 11, 2021

sedefsavas commented Mar 23, 2021

fejta-bot commented Apr 22, 2021

k8s-ci-robot commented Apr 22, 2021

richardcase commented Mar 22, 2023

richardcase commented Apr 3, 2023

dlipovetsky commented Apr 3, 2023

k8s-triage-robot commented Jul 2, 2023

k8s-triage-robot commented Jan 23, 2024

k8s-triage-robot commented Feb 22, 2024

AWSMachinePool does not drain nodes during scale-in #2023

AWSMachinePool does not drain nodes during scale-in #2023

Comments

dthorsen commented Oct 13, 2020 • edited Loading

fejta-bot commented Feb 7, 2021

fejta-bot commented Mar 9, 2021

randomvariable commented Mar 11, 2021

sedefsavas commented Mar 23, 2021

fejta-bot commented Apr 22, 2021

k8s-ci-robot commented Apr 22, 2021

richardcase commented Mar 22, 2023

richardcase commented Apr 3, 2023

dlipovetsky commented Apr 3, 2023

k8s-triage-robot commented Jul 2, 2023

k8s-triage-robot commented Jan 23, 2024

k8s-triage-robot commented Feb 22, 2024

dthorsen commented Oct 13, 2020 •

edited

Loading