Add support for Preservation of Machines and Backing nodes #1059

thiyyakat · 2025-12-10T09:32:40Z

What this PR does / why we need it:

This PR introduces a feature that allows operators and endusers to preserve a machine/node and the backing VM for diagnostic purposes.

The expected behaviour, use cases and usage are detailed in the proposal that can be found here

Which issue(s) this PR fixes:
Fixes #1008

Special notes for your reviewer:

Note: This section will be changed once I raise the PR

The code changes have been manually tested using the mcm-provider-virtual.
Some lines have been introduced in machine.go for simulating failure and recovery of nodes with the help of label "test-failed". These will be removed when I raise the PR.
All the unit tests pass
The following cases have been tested manually:

Additionally, I have manually tested if the code interferes with RollingUpdates - it doesn't seem to. I have also checked if drain works, only with the virtual-provider though.

TODO:

If autoscaler annotation is modified by user, it should be reset to true
Add unit tests
Test behaviour if isMachineCandidateForPreservation() in manageReplicas() returns an error.

Release note:

gardener-robot · 2025-12-10T09:32:56Z

@thiyyakat You need rebase this pull request with latest master branch. Please check.

# Conflicts: # pkg/util/provider/machinecontroller/machine.go # pkg/util/provider/machinecontroller/machine_util.go

* remove use of machineStatusUpdate in machine preservation code since it uses a similarity check * introduce check of phase change in updateMachine() to initiate drain of preserved machine on failure. This check is only for preserved machines

* Introduce new annotation value for preservation `PreserveMachineAnnotationValuePreservedByMCM` * Update Condition.Reason and Condition.Message to reflect preservation by user and auto-preservation * Update Machine Deployment Spec to include AutoPreservedFailedMachineMax * Modify MachineSet controller to update status with count of auto-preserved machines * Add updated CRDs and generated code

* Split larger functions into smaller ones * Remove debug comments * Add comments where required

thiyyakat · 2025-12-11T06:59:24Z

Questions that remain unanswered:

On recovery of a preserved machine, it transitions from Failed to Running. However, if the preserve annotation was when-failed, then the node continues to be preserved in Running even though the annotation says when-failed - is that okay? The node needs to be preserved so that pods can get scheduled onto it without CA scaling it down.
drain timeout is checked currently by calculating time from LastUpdateTime (from when machine moved to Failed) to now. Is there a better way to do it?
timeOutOccurred = utiltime.HasTimeOutOccurred(machine.Status.CurrentStatus.LastUpdateTime, timeOutDuration)
In the normal drain, it is checked wrt DeletionTimestamp
In some parts of the code, checks are performed to see if the returned error is due to a Conflict, and ConflictRetry rather than ShortRetry is returned. When should these checks be performed? The preservation flow has a lot of update calls.

…ler returns retry period

thiyyakat

Note: A review meeting was held today for this PR. The comments were given during the meeting.

During the meeting, we revisited the decision to move drain to Failed state for preserved machine. The reason discussed previously was that it didn't make sense semantically to move the machine to Terminating and then do the drain, because there is a possibility that the machine may recover. Since Terminating is a final state, the drain (separate from the drain in triggerDeletionFlow) will be performed in Failed phase. There was no change proposed during the meeting. This design decision was only reconfirmed.

pkg/util/provider/machinecontroller/machine.go

pkg/util/provider/machinecontroller/machine_util.go

pkg/controller/machineset.go

machine-controller-manager

takoverflow

Have only gone through half of the PR, have some suggestions PTAL.

pkg/apis/machine/v1alpha1/machine_types.go

pkg/controller/deployment_machineset_util.go

takoverflow · 2025-12-18T09:40:52Z

pkg/controller/machineset.go

+// or if it is a candidate for auto-preservation
+// TODO@thiyyakat: find more suitable name for function
+func (c *controller) isMachineCandidateForPreservation(ctx context.Context, machineSet *v1alpha1.MachineSet, machine *v1alpha1.Machine) (bool, error) {
+	if machineutils.IsPreserveExpiryTimeSet(machine) && !machineutils.HasPreservationTimedOut(machine) {


IsPreserveExpiryTimeSet already checks that the time is non-zero, then only HasPreservationTimedOut is called.
Is there any reason to perform the redundant IsZero check for PreserveExpiryTime again in HasPreservationTimedOut?
I don't see the function being called elsewhere as well.

If the zero check is removed, it could just be simplified to

func HasPreservationTimedOut(m *v1alpha1.Machine) bool { return !m.Status.CurrentStatus.PreserveExpiryTime.After(time.Now()) }

takoverflow · 2025-12-18T09:40:52Z

pkg/util/provider/machinecontroller/machine_util.go

+	}
+	nodeName := machine.Labels[v1alpha1.NodeLabelKey]
+	if nodeName != "" {
+		preservedCondition := v1.NodeCondition{


Consider renaming this to preservedConditionFalse?

takoverflow · 2025-12-18T09:40:52Z

pkg/util/provider/machinecontroller/machine_util.go

+		err := nodeops.AddOrUpdateConditionsOnNode(ctx, c.targetCoreClient, nodeName, preservedCondition)
+		if err != nil {
+			return err
+		}
+		// Step 2: remove CA's scale-down disabled annotations to allow CA to scale down node if needed
+		CAAnnotations := make(map[string]string)
+		CAAnnotations[autoscaler.ClusterAutoscalerScaleDownDisabledAnnotationKey] = ""
+		latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
+		if err != nil {
+			klog.Errorf("error trying to get backing node %q for machine %s. Retrying, error: %v", nodeName, machine.Name, err)
+			return err
+		}
+		latestNodeCopy := latestNode.DeepCopy()
+		latestNodeCopy, _, _ = annotations.RemoveAnnotation(latestNodeCopy, CAAnnotations) // error can be ignored, always returns nil
+		_, err = c.targetCoreClient.CoreV1().Nodes().Update(ctx, latestNodeCopy, metav1.UpdateOptions{})
+		if err != nil {
+			klog.Errorf("Node UPDATE failed for node %q of machine %q. Retrying, error: %s", nodeName, machine.Name, err)
+			return err
+		}


Is there a reason why there are two get and update calls made for a node, can these not be combined into a single atomic node object update?

And I know this is not part of your PR but can we update this RemoveAnnotation function, it's needlessly complicated.
All you have to do after fetching the object and checking that annotations are non-nil is

delete(obj.Annotations, annotationKey)

Creating a dummy annotation map, then passing it and then creating a new map which doesn't have the key. All of this complication can be avoided.

By 2 Get() calls are you referring to the call within AddOrUpdateConditionsOnNode and the following Get() here:
latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})?

The first one can be avoided if we didn't use the function. The second one is required because step 1 adds conditions to the node object, and the function does not return the updated node object. Fetching from the cache doesn't guarantee an up-to-date node object (tested this out empirically). I could potentially avoid fetching the objects if I didn't use the function. Will test it out.

The two update calls cannot be combined since step 1 requires an UpdateStatus() call, and step 2 updates the Spec, and requires an Update() call.

I will update the RemoveAnnotation function as recommended by you.

Edit: The RemoveAnnotation function returns a boolean indicating whether or not an update is needed. This value is being used in other usages of the function. The function cannot be updated. I will use your suggestion instead of using the function since the boolean value is not required in this case.

takoverflow · 2025-12-18T09:40:53Z

pkg/util/provider/machinecontroller/machine_util.go

+// stopMachinePreservation stops the preservation of the machine and node
+func (c *controller) stopMachinePreservation(ctx context.Context, machine *v1alpha1.Machine) error {
+	// check if preserveExpiryTime is set, if not, no need to do anything
+	if !machineutils.IsPreserveExpiryTimeSet(machine) {


Can there be scenarios where the preserveExpiryTime hasn't been set but the node has preserve conditions
and scale-down disabled annotation added to it? If so, then the removal will never proceed right?

Please let me know if it's not a possible scenario.

The setting of the preserveExpiryTime is the first step in machine preservation. Node conditions and the CA annotation are added only if the step 1 completes successfully. However, if a user manually adds the CA annotation and the node condition, but not the preserveExpiryTime then the case you described may occur. I'm not sure we should handle that case though.

takoverflow · 2025-12-18T09:40:53Z

pkg/util/provider/machinecontroller/machine.go

+	if nodeName == "" && isExpirySet {
+		return true, nil
+	}
+	node, err := c.nodeLister.Get(nodeName)


What happens when a machine doesn't have the nodeName set i.e. nodeName is "" and isExpirySet is false.
Wouldn't this always fail? Why even try to get the node in that case?

Why not move the check above outside of this function i.e. inside preserveMachine, fetch the nodeName and
use isExpirySet. WDYT?

if nodeName == "" && isExpirySet { return true, nil } if isExpirySet { isComplete, err := c.isMachinePreservationComplete(machine, nodeName) if err != nil { return err } }

takoverflow · 2025-12-18T09:40:53Z

pkg/util/provider/machinecontroller/machine.go

+		// if machine is preserved, stop preservation. Else, do nothing.
+		// this check is done in case the annotation value has changed from preserve=now to preserve=when-failed, in which case preservation needs to be stopped
+		preserveExpirySet := machineutils.IsPreserveExpiryTimeSet(clone)
+		machineFailed := machineutils.IsMachineFailed(clone)
+		if !preserveExpirySet && !machineFailed {
+			return
+		} else if !preserveExpirySet {
+			err = c.preserveMachine(ctx, clone, preserveValue)
+			return
+		}
+		// Here, we do not stop preservation even when preserve expiry time is set but the machine is in Running.
+		// This is to accommodate the case where the annotation is when-failed and the machine has recovered from Failed to Running.
+		// In this case, we want the preservation to continue so that CA does not scale down the node before pods are assigned to it
+		return


Please revisit this case, the comments and the code seem to contradict each other, if you wish to compare oldMachine annotation value with the newMachine
to make decisions to stop preservation etc, consider utilising updateMachine which would have both objects available.

Co-authored-by: Prashant Tak <[email protected]>

* fix edge case of handling switch from preserve=now to when-failed * Create map in package with valid preserve annotation values * Fix big where node condition's reason wouldn't get updated after toggling of preservation

* remove duplicate function to check preservation timeout * rename variables

* reduce get calls * remove usage of RemoveAnnotations()

gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else needs/rebase Needs git rebase labels Dec 10, 2025

gardener-robot added needs/review Needs review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 10, 2025

thiyyakat added 24 commits December 10, 2025 16:23

[WIP] Add support for machine preservation through annotations

a52dfd1

# Conflicts: # pkg/util/provider/machinecontroller/machine.go # pkg/util/provider/machinecontroller/machine_util.go

Add MachinePreserveTimeout to SafetyOptions.

034b99d

Add PreserveExpiryTime to machine.Status.CurrentStatus.

269458d

Remove AutoPreserveFailedMachineCount from machine set

20a0296

Fix linting error

137c3ef

Add generated files

f2a1335

Add support for preserve=now on node and machine objects

e1d9ec3

Update TODOs

dce1110

[WIP] Implement add/remove/update of node and machine annotations

83f17d9

Update preserve logic to honour node annotations over machine

b171230

Add preservation logic in machineset controller. TODO: remove debug logs

13d5ec7

Add drain logic post preservation of failed machine

d5757a2

Fix return for reconcileMachineHealth. Unit tests passing

174815f

Update CRDs

756c943

Fix bug causing repeated requeuing

b60e47d

Fix toggle between now and when-failed when machine has not failed.

db8f152

Fix bugs that prevented MCS update, and auto-preservation of machines

5deccf8

Add support for uncordoning preserved node that is healthy

d2ab501

Refactor code:

dfaeb9e

* Split larger functions into smaller ones * Remove debug comments * Add comments where required

Fix bug so that recovered preserved nodes are uncordoned

4623df6

Minor changes

da2db60

Change verb used in log statements for machine/node name

fb44eb1

thiyyakat force-pushed the feat/preserve-machine branch from dece79a to 06ecf58 Compare December 10, 2025 11:47

Fix mistake made during rebasing

89f2900

thiyyakat force-pushed the feat/preserve-machine branch from 06ecf58 to 89f2900 Compare December 10, 2025 12:06

Change return types of preservation util functions such that only cal…

63b44ea

…ler returns retry period

thiyyakat commented Dec 11, 2025

View reviewed changes

Address review comments

b98e49c

takoverflow reviewed Dec 12, 2025

View reviewed changes

machine-controller-manager Outdated Show resolved Hide resolved

Remove incorrect json tag and regenerate CRDs.

5497122

takoverflow requested changes Dec 18, 2025

View reviewed changes

gardener-robot added the needs/changes Needs (more) changes label Dec 18, 2025

thiyyakat and others added 2 commits December 19, 2025 11:07

Apply suggestions from code review - part 1

f180393

Co-authored-by: Prashant Tak <[email protected]>

Delete invalid gitlink

7c062b5

thiyyakat force-pushed the feat/preserve-machine branch from 22c646e to 7c062b5 Compare December 19, 2025 08:30

thiyyakat added 6 commits December 22, 2025 16:23

Address review comments- part 2:

15d106c

* fix edge case of handling switch from preserve=now to when-failed * Create map in package with valid preserve annotation values * Fix big where node condition's reason wouldn't get updated after toggling of preservation

Address review comments- part 3:

08a1e10

* remove duplicate function to check preservation timeout * rename variables

Address review comments- part 4:

b07223f

* reduce get calls * remove usage of RemoveAnnotations()

Add unit tests for preservation logic in machine.go

cf8f93e

Refactor tests to reduce redundancy in code.

4cb235f

Add tests for preservation logic in machine_util.go

1e19f73

Add support for Preservation of Machines and Backing nodes #1059

Are you sure you want to change the base?

Add support for Preservation of Machines and Backing nodes #1059

Uh oh!

Conversation

thiyyakat commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO:

Uh oh!

gardener-robot commented Dec 10, 2025

Uh oh!

thiyyakat commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thiyyakat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

takoverflow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

takoverflow Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

takoverflow Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

takoverflow Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thiyyakat Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

takoverflow Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

thiyyakat Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

takoverflow Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

takoverflow Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thiyyakat commented Dec 10, 2025 •

edited

Loading

thiyyakat commented Dec 11, 2025 •

edited

Loading

thiyyakat left a comment •

edited

Loading

takoverflow Dec 18, 2025 •

edited

Loading

thiyyakat Dec 23, 2025 •

edited

Loading

takoverflow Dec 18, 2025 •

edited

Loading