Ensure that there is a single actor which reduces the machine deployment replicas #181

unmarshall · 2023-02-28T11:06:45Z

What would you like to be added:

Context:
Issue #118 highlights the fact that even a small time difference between CA and MCM can result in a situation where CA's MCM provider can reduce the replicas of MachineDeployment to 0, in the process also deleting newly launched VMs. In the issue we specifically have a case where a Machine has been transitioned to Failed state by Machine controller because it could not start successfully (20 mins timeout). Machine controller will then launch a new VM. In the mean time CA also sees that ( 1-2 seconds earlier than MCM ) and marks this as a candidate to be deleted and that is addressed via MCM provider (https://github.com/gardener/autoscaler/blob/machine-controller-manager-provider/cluster-autoscaler/cloudprovider/mcm/mcm_manager.go#L435) which adds a priority annotation and reduces the replicas of MachineDeployment. In the issue the original number of replicas = 1, and now CA reduces it to 0. MCM which was in the middle of launching another VM now sees that the replicas are now set to 0 and then will stop all machines.

This happens because a single responsibility principle is broken w.r.t managing the replicas for a machine deployment.

Why is this needed:

There is a need to define clear boundaries in the responsibility set between CA and MCM so as to prevent CA stepping over MCM.

CA's responsibility:

Scale out (within [min, max]) in case there are unscheduled pods.
Scale in (within [min, max]) in case there are under utilised nodes. In this process it should not drain the node as that is solely the responsibility of MCM. We have seen CA's implementation of draining a node and it does not take care of properly evicting pods with PVs.

MCM's responsibility

Ensuring that it continuously attempts to reconcile MachineDeployment, MachineSet and Machine objects as per the desired state. In case a machine does not become healthy in 20 mins (current timeout) then it should be only its job to ensure that it launches another machine and stops/deletes the older FAILED machine.
React to requests from CA for scale up and scale down MachineDeployment's.

There are other responsibilities of each of the above actors, however we have only listed the ones where there is an overlap.

The text was updated successfully, but these errors were encountered:

unmarshall added the kind/enhancement Enhancement, improvement, extension label Feb 28, 2023

elankath mentioned this issue Feb 28, 2023

Consider failed machine as terminating #118

Closed

himanshu-kun added priority/3 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers labels Mar 1, 2023

mattburgess mentioned this issue Jun 8, 2023

Scale down of machineset during rolling update gardener/machine-controller-manager#826

Closed

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 8, 2023

unmarshall mentioned this issue Jan 19, 2024

☂️ CA-MCM Overhaul gardener/machine-controller-manager#895

Open

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that there is a single actor which reduces the machine deployment replicas #181

Ensure that there is a single actor which reduces the machine deployment replicas #181

unmarshall commented Feb 28, 2023

Ensure that there is a single actor which reduces the machine deployment replicas #181

Ensure that there is a single actor which reduces the machine deployment replicas #181

Comments

unmarshall commented Feb 28, 2023