-
Notifications
You must be signed in to change notification settings - Fork 464
Implement the PodTermination controller to gracefully handle "stuck" pods
#7312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the PodTermination controller to gracefully handle "stuck" pods
#7312
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @kshalot. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
9aa3994 to
5c43140
Compare
5c43140 to
7a60a5d
Compare
1c11692 to
a008618
Compare
a008618 to
da09e19
Compare
6ce4de0 to
544079a
Compare
| g.Expect(k8sClient.Get(ctx, types.NamespacedName{Name: matchingPod.Name, Namespace: matchingPod.Namespace}, matchingPod)). | ||
| To(gomega.Succeed()) | ||
| g.Expect(matchingPod.Status.Phase).Should(gomega.Equal(corev1.PodFailed)) | ||
| }, forcefulTerminationTimeout, util.Interval).Should(gomega.Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this flake? because we expect the Pod to transition to failed after forcefulTerminationTimeout, at the same time this is only how long we are waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your question raises a valid point about naming, because forcefulTerminationTimeout is simply the timeout for eventually/consistently, not the "grace period" used in the feature.
Since the tests don't seem to use the value of the forceful termination grace period, I just set it to time.Millisecond here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I interpreted the name as the constant as also used by the feature. Let me check again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I renamed the test timeout to forcefulTerminationCheckTimeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, just check locally the tests don't flake by running them in a loop something like 50 times. We don't want introducing flakes before release
| KueueFinalizer() | ||
|
|
||
| cases := map[string]struct { | ||
| testPod *corev1.Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split this into pod and wantPod. Always supplied to avoid tricks in https://github.com/kubernetes-sigs/kueue/pull/7312/files#diff-547b122e7c9dbc55a906f70d85ca16010acd456bb20e02158debbd9cd1b23a46R168-R171
Yes, this is more lines of test code, but very declarative in nature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wantPod was already in the struct, it was just optional because most cases expected wantPod == testPod. I made it explicit in 606dcc6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| } | ||
|
|
||
| func (r *TerminatingPodReconciler) Update(u event.UpdateEvent) bool { | ||
| oldPod := u.ObjectOld.(*corev1.Pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC these castings are required for the oldschool registration of filtering, check how this is done in cohort controller to use strong typing which then does not require explicit casting in code: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/core/cohort_controller.go#L109-L130
| Reason: KueueForcefulTerminationReason, | ||
| Message: eventMessage, | ||
| }) | ||
| if err := r.client.Status().Patch(ctx, podPatch, client.MergeFrom(pod)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch may potentially override some conditions or status changes done concurrently by another controller. To avoid that use our helper in clientutil which supports "strict" mode which compares the ResrouceVersion.
| } | ||
|
|
||
| // Pod was already terminated | ||
| if utilpod.IsTerminated(pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the check to the event handlers too. And wrap all the Pod related checks to a helper
func preconditionsMet(p *corev1.Pod) bool, feel free to change the naming.
Then from Update call it for the newPod
|
/lgtm |
|
LGTM label has been added. Git tree hash: 67abe4973c968439bea409980a53bde254d290fe
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kshalot, mimowo, olekzabl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test pull-kueue-test-unit-main |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Implements the
PodTerminationaction controller described in #7311.Which issue(s) this PR fixes:
Fixes #6757
Special notes for your reviewer:
This is a proof-of-concept, very rough sketch of the implementation that I used for testing. Leaving it here, as I'm going on vacation, in case this work needs to be urgently picked up for some reason. If not, it will be continued in 2 weeks.Does this PR introduce a user-facing change?