Skip to content

Conversation

@kshalot
Copy link
Contributor

@kshalot kshalot commented Oct 17, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements the PodTermination action controller described in #7311.

Which issue(s) this PR fixes:

Fixes #6757

Special notes for your reviewer:

This is a proof-of-concept, very rough sketch of the implementation that I used for testing. Leaving it here, as I'm going on vacation, in case this work needs to be urgently picked up for some reason. If not, it will be continued in 2 weeks.

Does this PR introduce a user-facing change?

Introduce a mechanism to terminate Pods "stuck" in a terminating state due to node failures.
The feature is activated by enabling the alpha FailureRecoveryPolicy feature gate (disabled by default).
Only Pods with the kueue.x-k8s.io/safe-to-forcefully-terminate annotation are handled by the mechanism.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 17, 2025
@netlify
Copy link

netlify bot commented Oct 17, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit d15bb80
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/691f446ac33d9d000890ee8e

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @kshalot. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 17, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 17, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 17, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 31, 2025
@kshalot kshalot force-pushed the 6757-gracefully-handle-zombie-pods branch from 9aa3994 to 5c43140 Compare November 4, 2025 08:46
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@kshalot kshalot force-pushed the 6757-gracefully-handle-zombie-pods branch from 5c43140 to 7a60a5d Compare November 4, 2025 09:46
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 4, 2025
@kshalot kshalot force-pushed the 6757-gracefully-handle-zombie-pods branch 3 times, most recently from 1c11692 to a008618 Compare November 4, 2025 11:51
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2025
@kshalot kshalot force-pushed the 6757-gracefully-handle-zombie-pods branch from a008618 to da09e19 Compare November 7, 2025 15:01
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 7, 2025
@kshalot kshalot force-pushed the 6757-gracefully-handle-zombie-pods branch 2 times, most recently from 6ce4de0 to 544079a Compare November 12, 2025 13:39
g.Expect(k8sClient.Get(ctx, types.NamespacedName{Name: matchingPod.Name, Namespace: matchingPod.Namespace}, matchingPod)).
To(gomega.Succeed())
g.Expect(matchingPod.Status.Phase).Should(gomega.Equal(corev1.PodFailed))
}, forcefulTerminationTimeout, util.Interval).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this flake? because we expect the Pod to transition to failed after forcefulTerminationTimeout, at the same time this is only how long we are waiting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your question raises a valid point about naming, because forcefulTerminationTimeout is simply the timeout for eventually/consistently, not the "grace period" used in the feature.

Since the tests don't seem to use the value of the forceful termination grace period, I just set it to time.Millisecond here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I interpreted the name as the constant as also used by the feature. Let me check again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I renamed the test timeout to forcefulTerminationCheckTimeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, just check locally the tests don't flake by running them in a loop something like 50 times. We don't want introducing flakes before release

KueueFinalizer()

cases := map[string]struct {
testPod *corev1.Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this into pod and wantPod. Always supplied to avoid tricks in https://github.com/kubernetes-sigs/kueue/pull/7312/files#diff-547b122e7c9dbc55a906f70d85ca16010acd456bb20e02158debbd9cd1b23a46R168-R171

Yes, this is more lines of test code, but very declarative in nature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wantPod was already in the struct, it was just optional because most cases expected wantPod == testPod. I made it explicit in 606dcc6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

}

func (r *TerminatingPodReconciler) Update(u event.UpdateEvent) bool {
oldPod := u.ObjectOld.(*corev1.Pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC these castings are required for the oldschool registration of filtering, check how this is done in cohort controller to use strong typing which then does not require explicit casting in code: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/core/cohort_controller.go#L109-L130

Reason: KueueForcefulTerminationReason,
Message: eventMessage,
})
if err := r.client.Status().Patch(ctx, podPatch, client.MergeFrom(pod)); err != nil {
Copy link
Contributor

@mimowo mimowo Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch may potentially override some conditions or status changes done concurrently by another controller. To avoid that use our helper in clientutil which supports "strict" mode which compares the ResrouceVersion.

}

// Pod was already terminated
if utilpod.IsTerminated(pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the check to the event handlers too. And wrap all the Pod related checks to a helper
func preconditionsMet(p *corev1.Pod) bool, feel free to change the naming.
Then from Update call it for the newPod

@mimowo
Copy link
Contributor

mimowo commented Nov 21, 2025

/lgtm
/approve
Thank you 👍 I think the remaining comments aren't really blocking for Alpha so merging to avoid conflicts with other PRs. I will still approve the follow ups before the release.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 67abe4973c968439bea409980a53bde254d290fe

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kshalot, mimowo, olekzabl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2025
@mimowo
Copy link
Contributor

mimowo commented Nov 21, 2025

@mimowo
Copy link
Contributor

mimowo commented Nov 21, 2025

/test pull-kueue-test-unit-main

@k8s-ci-robot k8s-ci-robot merged commit 2e3bd8a into kubernetes-sigs:main Nov 21, 2025
22 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.15 milestone Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gracefully handle Pods "stuck" terminating due to kubelet being down

5 participants