Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738

Open
mimowo opened this issue Aug 1, 2024 · 11 comments · May be fixed by #2745
Open

[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738

mimowo opened this issue Aug 1, 2024 · 11 comments · May be fixed by #2745
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@mimowo
Copy link
Contributor

mimowo commented Aug 1, 2024

What happened:

The period e2e test failed on deleting kind cluster: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160.

This may happen when the entire test suite is green.

It looks like a rare flake (https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-multikueue-e2e-main):
image

What you expected to happen:

No random failures.

How to reproduce it (as minimally and precisely as possible):

Repeat the build, it happened on periodic build.

Anything else we need to know?:

The logs from the failure:

Ginkgo ran 1 suite in 3m16.816345076s
Test Suite Passed
Switched to context "kind-kind-manager".
Exporting logs for cluster "kind-manager" to:
/logs/artifacts/run-test-multikueue-e2e-1.30.0
No resources found in default namespace.
Deleting cluster "kind-manager" ...
ERROR: failed to delete cluster "kind-manager": failed to delete nodes: command "docker rm -f -v kind-manager-control-plane" failed with error: exit status 1

Command Output: Error response from daemon: cannot remove container "/kind-manager-control-plane": could not kill: tried to kill container, but did not receive an exit event
make: *** [Makefile-test.mk:100: run-test-multikueue-e2e-1.30.0] Error 1
+ EXIT_VALUE=2
+ set +o xtrace
Cleaning up after docker in docker.
================================================================================
Waiting 30 seconds for pods stopped with terminationGracePeriod:30
Cleaning up after docker
ed68da3fb667
6beb571d417e
bf442dfcffc1
Waiting for docker to stop for 30 seconds
Stopping Docker: dockerProgram process in pidfile '/var/run/docker-ssd.pid', 1 process(es), refused to die.

It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.

@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Aug 1, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Aug 1, 2024

/cc @mbobrovskyi @trasc

@mimowo
Copy link
Contributor Author

mimowo commented Aug 1, 2024

/kind flake

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 1, 2024
@trasc
Copy link
Contributor

trasc commented Aug 1, 2024

/assign

@trasc trasc linked a pull request Aug 1, 2024 that will close this issue
@trasc
Copy link
Contributor

trasc commented Aug 1, 2024

It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.

That timeout is part of the test image and the failure at that point is already ignored.

The issue is indeed related to kind delete however since is very little we ca do about it and it has nothing to do with the e2e suites we should just ignore it as we do wit other cleanup steps.

@BenTheElder
Copy link
Member

This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.

In the future when you see issues like this please go ahead and reach out to the kind project.

cc @aojea

I'm fairly occupied today but can probably dig into this by sometime Monday.

@mimowo
Copy link
Contributor Author

mimowo commented Aug 2, 2024

This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.

Right, I've never seen this in the code k8s, but this is the first time I see it in Kueue too, so maybe this is some very rare one-off.

@aojea
Copy link

aojea commented Aug 2, 2024

refused to die.

😮‍💨

https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160/build-log.txt

ERROR: failed to delete cluster "kind-manager": failed to delete nodes: command "docker rm -f -v kind-manager-control-plane" failed with error: exit status 1
Command Output: Error response from daemon: cannot remove container "/kind-manager-control-plane": could not kill: tried to kill container, but did not receive an exit event

what are these e2e doing with the network @mimowo ?

�[38;5;243m/home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:463�[0m
  �[1mSTEP:�[0m wait for check active �[38;5;243m@ 07/31/24 18:01:33.731�[0m
  �[1mSTEP:�[0m Disconnecting worker1 container from the kind network �[38;5;243m@ 07/31/24 18:01:34.06�[0m
  �[1mSTEP:�[0m Waiting for the cluster to become inactive �[38;5;243m@ 07/31/24 18:01:34.54�[0m
  �[1mSTEP:�[0m Reconnecting worker1 container to the kind network �[38;5;243m@ 07/31/24 18:02:19.212�[0m
  �[1mSTEP:�[0m Waiting for the cluster do become active �[38;5;243m@ 07/31/24 18:02:49.147�[0m
�[38;5;10m• [77.390 seconds]�[0m

@mimowo
Copy link
Contributor Author

mimowo commented Aug 5, 2024

refused to die.

😮‍💨

This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.

what are these e2e doing with the network @mimowo ?

These are tests for MultiKueue. We run 3 Kind clusters (one manager and 2 workers). We disconnect the network between the manager and a worker using that command: docker network disconnect kind kind-worker1-control-plane. Later we re-connect the clusters.

It is done to simulate transient connectivity issues between the clusters.

@aojea
Copy link

aojea commented Aug 5, 2024

seems the other bug reported kubernetes/kubernetes#123313 with the same symptom was fixed by kubernetes/test-infra#32245

@BenTheElder
Copy link
Member

This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.

Yeah, that's different, the process cleanup of the docker daemon is less concerning when we're succesfully deleting the node containers (which we did in that link, prior to the issue turning down the docker daemon), I don't think that's related but should also be tracked (kubernetes/test-infra#33227).

In that link we can see kind delete cluster successfully deleting the nodes without timeout issues.

so maybe this is some very rare one-off.

I think they may also run on different clusters (k8s-infra-prow-build vs the k8s infra EKS cluster) with different OS, machine type, etc. There may be some different quirk with the environment between them.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants