Ensure that there is a single actor which reduces the machine deployment replicas #181
Labels
kind/enhancement
Enhancement, improvement, extension
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
needs/planning
Needs (more) planning with other MCM maintainers
priority/3
Priority (lower number equals higher priority)
What would you like to be added:
Context:
Issue #118 highlights the fact that even a small time difference between CA and MCM can result in a situation where CA's MCM provider can reduce the replicas of MachineDeployment to 0, in the process also deleting newly launched VMs. In the issue we specifically have a case where a Machine has been transitioned to
Failed
state by Machine controller because it could not start successfully (20 mins timeout). Machine controller will then launch a new VM. In the mean time CA also sees that ( 1-2 seconds earlier than MCM ) and marks this as a candidate to be deleted and that is addressed via MCM provider (https://github.com/gardener/autoscaler/blob/machine-controller-manager-provider/cluster-autoscaler/cloudprovider/mcm/mcm_manager.go#L435) which adds a priority annotation and reduces the replicas ofMachineDeployment
. In the issue the original number of replicas = 1, and now CA reduces it to 0. MCM which was in the middle of launching another VM now sees that the replicas are now set to 0 and then will stop all machines.This happens because a single responsibility principle is broken w.r.t managing the replicas for a machine deployment.
Why is this needed:
There is a need to define clear boundaries in the responsibility set between CA and MCM so as to prevent CA stepping over MCM.
CA's responsibility:
MCM's responsibility
There are other responsibilities of each of the above actors, however we have only listed the ones where there is an overlap.
The text was updated successfully, but these errors were encountered: