Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e] Ignore cluster delete failures #2745

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Aug 1, 2024

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Ignore the failure of cluster deletion in the e2e test's cleanup.

Which issue(s) this PR fixes:

Fixes #2738

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. labels Aug 1, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: trasc
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 1, 2024
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 1, 2024
Copy link

netlify bot commented Aug 1, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 8fcd637
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66b340b7b47864000890d532
😎 Deploy Preview https://deploy-preview-2745--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -34,7 +34,7 @@ function cluster_cleanup {
$KIND export logs $ARTIFACTS --name $1 || true
kubectl describe pods -n kueue-system > $ARTIFACTS/$1-kueue-system-pods.log || true
kubectl describe pods > $ARTIFACTS/$1-default-pods.log || true
$KIND delete cluster --name $1
$KIND delete cluster -v=3 --name $1 || echo "Ignoring cluster delete failure."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point out another example where we ignore such a failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the three lines above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong view, but maybe it makes sense to only merge the -v3 so that we can get more input if this repeats?
Otherwise we will not know if this happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I doubt that -v=3 will make much of a difference (based on how I see the implementation in kind).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see two paths forward:

  1. only increase log verbosity now (the level TBD, maybe v5 is better?), and wait for repeating to see if we can understand it
  2. ignore the errors as in the current version of the PR

I would be slightly leaning towards 1, as we are not under any pressure (the flake is rare), and it would be nice to understand what is going on.

OTOH I'm ok to accept as is, if this is @tenzen-y preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked kind issue is really old, do we have current examples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster delete failure should not be happening.

Yeah, but they do, as shown in the example. I'm not sure about the frequency,

Cluster deletion is reentrant and will succeed if there is nothing to do.

So, we may expect there is a good chance a retry would help? This might be safer, but it is a code complication, so I'm not convinced either.

Maybe we just park it and only proceed with ignoring if this repeats, so that we see the frequency? Did you see this before @trasc or @tenzen-y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind is just dropping any output of the docker delete command and only prints the exit code.

that's not accurate. in kind exec errors contain the command output (which is always captured) and it is being printed in the error message ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind is just dropping any output of the docker delete command and only prints the exit code.

that's not accurate. in kind exec errors contain the command output (which is always captured) and it is being printed in the error message ...

Indeed, I missed that.

Cluster delete failure should not be happening. Cluster deletion is reentrant and will succeed if there is nothing to do.

It is, more or less, happening in our case as cleanup_dind() is able to delete all the containers.

Waiting 30 seconds for pods stopped with terminationGracePeriod:30
Cleaning up after docker
ed68da3fb667
6beb571d417e
bf442dfcffc1
...

Did you see this before @trasc or @tenzen-y ?

No

Copy link
Contributor Author

@trasc trasc Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenTheElder I think we could make the DeleteNodes a two step op, docker stop (this is able to send two signals with a timeout between them) and docker rm after that.

I can open a PR in kind for this, just let me know.

@trasc
Copy link
Contributor Author

trasc commented Aug 1, 2024

/cc @tenzen-y

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 5, 2024
@mimowo
Copy link
Contributor

mimowo commented Aug 6, 2024

/close
I don't think we have enough evidence to justify ignoring cluster deletion errors in Kueue based on a single observed failure (see comment). I'm concerned that doing so could sweep issues under the carpet, leading to unnoticed cluster leaks by the team. A similar sentiment was expressed in this comment.

For now, I'm open to exploring solutions on another layer, possibly as suggested in this comment, but I'll leave the decision to @BenTheElder.

I'm also open to revisiting the idea of ignoring errors in Kueue if this issue becomes more frequent, we rule out that the errors are related to Kueue, and we've exhausted options for resolving it at other layers.

We can keep the issue open for further discussion.

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closed this PR.

In response to this:

/close
I don't think we have enough evidence to justify ignoring cluster deletion errors in Kueue based on a single observed failure (see comment). I'm concerned that doing so could sweep issues under the carpet, leading to unnoticed cluster leaks by the team. A similar sentiment was expressed in this comment.

For now, I'm open to exploring solutions on another layer, possibly as suggested in this comment, but I'll leave the decision to @BenTheElder.

I'm also open to revisiting the idea of ignoring errors in Kueue if this issue becomes more frequent, we rule out that the errors are related to Kueue, and we've exhausted options for resolving it at other layers.

We can keep the issue open for further discussion.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2024
@mimowo
Copy link
Contributor

mimowo commented Aug 7, 2024

/reopen
I synced in @trasc who suggests we should continue the PR open for discussion and possibly merging.
I would still be leaning to a fix in Kind, but let's keep both options open.

@k8s-ci-robot
Copy link
Contributor

@mimowo: Failed to re-open PR: state cannot be changed. The flaky-e2e-delete-cluster branch was force-pushed or recreated.

In response to this:

/reopen
I synced in @trasc who suggests we should continue the PR open for discussion and possibly merging.
I would still be leaning to a fix in Kind, but let's keep both options open.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo
Copy link
Contributor

mimowo commented Aug 7, 2024

/reopen
retrying

@k8s-ci-robot
Copy link
Contributor

@mimowo: Failed to re-open PR: state cannot be changed. The flaky-e2e-delete-cluster branch was force-pushed or recreated.

In response to this:

/reopen
retrying

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@trasc
Copy link
Contributor Author

trasc commented Aug 7, 2024

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 7, 2024
@k8s-ci-robot
Copy link
Contributor

@trasc: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 7, 2024
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 7, 2024
@trasc trasc closed this Oct 9, 2024
@trasc trasc deleted the flaky-e2e-delete-cluster branch October 9, 2024 06:47
@trasc trasc restored the flaky-e2e-delete-cluster branch October 9, 2024 06:48
@trasc trasc reopened this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky test] e2e tests occasionally fail when deleting kind cluster
5 participants