-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738
Comments
/cc @mbobrovskyi @trasc |
/kind flake |
/assign |
That timeout is part of the test image and the failure at that point is already ignored. The issue is indeed related to |
This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests. In the future when you see issues like this please go ahead and reach out to the kind project. cc @aojea I'm fairly occupied today but can probably dig into this by sometime Monday. |
Right, I've never seen this in the code k8s, but this is the first time I see it in Kueue too, so maybe this is some very rare one-off. |
😮💨
what are these e2e doing with the network @mimowo ?
|
This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.
These are tests for MultiKueue. We run 3 Kind clusters (one manager and 2 workers). We disconnect the network between the manager and a worker using that command: It is done to simulate transient connectivity issues between the clusters. |
seems the other bug reported kubernetes/kubernetes#123313 with the same symptom was fixed by kubernetes/test-infra#32245 |
Yeah, that's different, the process cleanup of the docker daemon is less concerning when we're succesfully deleting the node containers (which we did in that link, prior to the issue turning down the docker daemon), I don't think that's related but should also be tracked (kubernetes/test-infra#33227). In that link we can see
I think they may also run on different clusters (k8s-infra-prow-build vs the k8s infra EKS cluster) with different OS, machine type, etc. There may be some different quirk with the environment between them. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What happened:
The period e2e test failed on deleting kind cluster: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160.
This may happen when the entire test suite is green.
It looks like a rare flake (https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-multikueue-e2e-main):
What you expected to happen:
No random failures.
How to reproduce it (as minimally and precisely as possible):
Repeat the build, it happened on periodic build.
Anything else we need to know?:
The logs from the failure:
It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.
The text was updated successfully, but these errors were encountered: