[e2e] Ignore cluster delete failures #2745

trasc · 2024-08-01T12:34:35Z

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Ignore the failure of cluster deletion in the e2e test's cleanup.

Which issue(s) this PR fixes:

Fixes #2738

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

k8s-ci-robot · 2024-08-01T12:34:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: trasc
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-08-01T12:34:53Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`8fcd637`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66b340b7b47864000890d532
😎 Deploy Preview	https://deploy-preview-2745--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mimowo · 2024-08-01T12:52:37Z

hack/e2e-common.sh

@@ -34,7 +34,7 @@ function cluster_cleanup {
        $KIND export logs $ARTIFACTS --name $1 || true
        kubectl describe pods -n kueue-system > $ARTIFACTS/$1-kueue-system-pods.log || true
        kubectl describe pods > $ARTIFACTS/$1-default-pods.log || true
-        $KIND delete cluster --name $1
+        $KIND delete cluster -v=3 --name $1 || echo "Ignoring cluster delete failure."


Can you point out another example where we ignore such a failure?

the three lines above

No strong view, but maybe it makes sense to only merge the -v3 so that we can get more input if this repeats?
Otherwise we will not know if this happened.

Unfortunately I doubt that -v=3 will make much of a difference (based on how I see the implementation in kind).

So I see two paths forward:

only increase log verbosity now (the level TBD, maybe v5 is better?), and wait for repeating to see if we can understand it

ignore the errors as in the current version of the PR

I would be slightly leaning towards 1, as we are not under any pressure (the flake is rare), and it would be nice to understand what is going on.

OTOH I'm ok to accept as is, if this is @tenzen-y preference.

The linked kind issue is really old, do we have current examples?

Cluster delete failure should not be happening.

Yeah, but they do, as shown in the example. I'm not sure about the frequency,

Cluster deletion is reentrant and will succeed if there is nothing to do.

So, we may expect there is a good chance a retry would help? This might be safer, but it is a code complication, so I'm not convinced either.

Maybe we just park it and only proceed with ignoring if this repeats, so that we see the frequency? Did you see this before @trasc or @tenzen-y ?

Kind is just dropping any output of the docker delete command and only prints the exit code.

that's not accurate. in kind exec errors contain the command output (which is always captured) and it is being printed in the error message ...

Kind is just dropping any output of the docker delete command and only prints the exit code.

that's not accurate. in kind exec errors contain the command output (which is always captured) and it is being printed in the error message ...

Indeed, I missed that.

Cluster delete failure should not be happening. Cluster deletion is reentrant and will succeed if there is nothing to do.

It is, more or less, happening in our case as cleanup_dind() is able to delete all the containers.

Waiting 30 seconds for pods stopped with terminationGracePeriod:30 Cleaning up after docker ed68da3fb667 6beb571d417e bf442dfcffc1 ...

Did you see this before @trasc or @tenzen-y ?

No

@BenTheElder I think we could make the DeleteNodes a two step op, docker stop (this is able to send two signals with a timeout between them) and docker rm after that.

I can open a PR in kind for this, just let me know.

trasc · 2024-08-01T13:40:27Z

/cc @tenzen-y

mimowo · 2024-08-06T12:35:25Z

/close
I don't think we have enough evidence to justify ignoring cluster deletion errors in Kueue based on a single observed failure (see comment). I'm concerned that doing so could sweep issues under the carpet, leading to unnoticed cluster leaks by the team. A similar sentiment was expressed in this comment.

For now, I'm open to exploring solutions on another layer, possibly as suggested in this comment, but I'll leave the decision to @BenTheElder.

I'm also open to revisiting the idea of ignoring errors in Kueue if this issue becomes more frequent, we rule out that the errors are related to Kueue, and we've exhausted options for resolving it at other layers.

We can keep the issue open for further discussion.

k8s-ci-robot · 2024-08-06T12:35:29Z

@mimowo: Closed this PR.

In response to this:

/close
I don't think we have enough evidence to justify ignoring cluster deletion errors in Kueue based on a single observed failure (see comment). I'm concerned that doing so could sweep issues under the carpet, leading to unnoticed cluster leaks by the team. A similar sentiment was expressed in this comment.

For now, I'm open to exploring solutions on another layer, possibly as suggested in this comment, but I'll leave the decision to @BenTheElder.

I'm also open to revisiting the idea of ignoring errors in Kueue if this issue becomes more frequent, we rule out that the errors are related to Kueue, and we've exhausted options for resolving it at other layers.

We can keep the issue open for further discussion.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo · 2024-08-07T09:05:29Z

/reopen
I synced in @trasc who suggests we should continue the PR open for discussion and possibly merging.
I would still be leaning to a fix in Kind, but let's keep both options open.

k8s-ci-robot · 2024-08-07T09:05:33Z

@mimowo: Failed to re-open PR: state cannot be changed. The flaky-e2e-delete-cluster branch was force-pushed or recreated.

In response to this:

/reopen
I synced in @trasc who suggests we should continue the PR open for discussion and possibly merging.
I would still be leaning to a fix in Kind, but let's keep both options open.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo · 2024-08-07T09:35:43Z

/reopen
retrying

k8s-ci-robot · 2024-08-07T09:35:47Z

@mimowo: Failed to re-open PR: state cannot be changed. The flaky-e2e-delete-cluster branch was force-pushed or recreated.

In response to this:

/reopen
retrying

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

trasc · 2024-08-07T09:38:21Z

/reopen

k8s-ci-robot · 2024-08-07T09:38:25Z

@trasc: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. labels Aug 1, 2024

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 1, 2024

k8s-ci-robot requested review from denkensk and mimowo August 1, 2024 12:34

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 1, 2024

mimowo reviewed Aug 1, 2024

View reviewed changes

k8s-ci-robot requested a review from tenzen-y August 1, 2024 13:40

trasc force-pushed the flaky-e2e-delete-cluster branch from a95d1b0 to e4bb871 Compare August 5, 2024 13:05

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 5, 2024

k8s-ci-robot closed this Aug 6, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2024

[e2e] Ignore cluster delete failures

8fcd637

k8s-ci-robot reopened this Aug 7, 2024

trasc force-pushed the flaky-e2e-delete-cluster branch from e4bb871 to 8fcd637 Compare August 7, 2024 09:39

k8s-ci-robot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 7, 2024

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 7, 2024

trasc closed this Oct 9, 2024

trasc deleted the flaky-e2e-delete-cluster branch October 9, 2024 06:47

trasc restored the flaky-e2e-delete-cluster branch October 9, 2024 06:48

trasc reopened this Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[e2e] Ignore cluster delete failures #2745

[e2e] Ignore cluster delete failures #2745

trasc commented Aug 1, 2024

k8s-ci-robot commented Aug 1, 2024

netlify bot commented Aug 1, 2024 •

edited

Loading

mimowo Aug 1, 2024

trasc Aug 1, 2024

mimowo Aug 1, 2024

trasc Aug 1, 2024

mimowo Aug 1, 2024

BenTheElder Aug 2, 2024

mimowo Aug 2, 2024

BenTheElder Aug 2, 2024

trasc Aug 5, 2024

trasc Aug 5, 2024 •

edited

Loading

trasc commented Aug 1, 2024

mimowo commented Aug 6, 2024

k8s-ci-robot commented Aug 6, 2024

mimowo commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

mimowo commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

trasc commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

[e2e] Ignore cluster delete failures #2745

Are you sure you want to change the base?

[e2e] Ignore cluster delete failures #2745

Conversation

trasc commented Aug 1, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Aug 1, 2024

netlify bot commented Aug 1, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trasc Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

trasc commented Aug 1, 2024

mimowo commented Aug 6, 2024

k8s-ci-robot commented Aug 6, 2024

mimowo commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

mimowo commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

trasc commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

netlify bot commented Aug 1, 2024 •

edited

Loading

trasc Aug 5, 2024 •

edited

Loading