[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738

mimowo · 2024-08-01T09:23:11Z

What happened:

The period e2e test failed on deleting kind cluster: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160.

This may happen when the entire test suite is green.

It looks like a rare flake (https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-multikueue-e2e-main):

What you expected to happen:

No random failures.

How to reproduce it (as minimally and precisely as possible):

Repeat the build, it happened on periodic build.

Anything else we need to know?:

The logs from the failure:

Ginkgo ran 1 suite in 3m16.816345076s
Test Suite Passed
Switched to context "kind-kind-manager".
Exporting logs for cluster "kind-manager" to:
/logs/artifacts/run-test-multikueue-e2e-1.30.0
No resources found in default namespace.
Deleting cluster "kind-manager" ...
ERROR: failed to delete cluster "kind-manager": failed to delete nodes: command "docker rm -f -v kind-manager-control-plane" failed with error: exit status 1

Command Output: Error response from daemon: cannot remove container "/kind-manager-control-plane": could not kill: tried to kill container, but did not receive an exit event
make: *** [Makefile-test.mk:100: run-test-multikueue-e2e-1.30.0] Error 1
+ EXIT_VALUE=2
+ set +o xtrace
Cleaning up after docker in docker.
================================================================================
Waiting 30 seconds for pods stopped with terminationGracePeriod:30
Cleaning up after docker
ed68da3fb667
6beb571d417e
bf442dfcffc1
Waiting for docker to stop for 30 seconds
Stopping Docker: dockerProgram process in pidfile '/var/run/docker-ssd.pid', 1 process(es), refused to die.

It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.

The text was updated successfully, but these errors were encountered:

mimowo · 2024-08-01T09:23:30Z

/cc @mbobrovskyi @trasc

mimowo · 2024-08-01T09:23:38Z

/kind flake

trasc · 2024-08-01T09:29:40Z

/assign

trasc · 2024-08-01T12:40:21Z

It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.

That timeout is part of the test image and the failure at that point is already ignored.

The issue is indeed related to kind delete however since is very little we ca do about it and it has nothing to do with the e2e suites we should just ignore it as we do wit other cleanup steps.

BenTheElder · 2024-08-02T17:20:04Z

This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.

In the future when you see issues like this please go ahead and reach out to the kind project.

cc @aojea

I'm fairly occupied today but can probably dig into this by sometime Monday.

mimowo · 2024-08-02T17:23:57Z

This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.

Right, I've never seen this in the code k8s, but this is the first time I see it in Kueue too, so maybe this is some very rare one-off.

aojea · 2024-08-02T17:43:16Z

refused to die.

😮‍💨

https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160/build-log.txt

ERROR: failed to delete cluster "kind-manager": failed to delete nodes: command "docker rm -f -v kind-manager-control-plane" failed with error: exit status 1
Command Output: Error response from daemon: cannot remove container "/kind-manager-control-plane": could not kill: tried to kill container, but did not receive an exit event

what are these e2e doing with the network @mimowo ?

�[38;5;243m/home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:463�[0m
  �[1mSTEP:�[0m wait for check active �[38;5;243m@ 07/31/24 18:01:33.731�[0m
  �[1mSTEP:�[0m Disconnecting worker1 container from the kind network �[38;5;243m@ 07/31/24 18:01:34.06�[0m
  �[1mSTEP:�[0m Waiting for the cluster to become inactive �[38;5;243m@ 07/31/24 18:01:34.54�[0m
  �[1mSTEP:�[0m Reconnecting worker1 container to the kind network �[38;5;243m@ 07/31/24 18:02:19.212�[0m
  �[1mSTEP:�[0m Waiting for the cluster do become active �[38;5;243m@ 07/31/24 18:02:49.147�[0m
�[38;5;10m• [77.390 seconds]�[0m

mimowo · 2024-08-05T07:19:59Z

refused to die.

😮‍💨

This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.

what are these e2e doing with the network @mimowo ?

These are tests for MultiKueue. We run 3 Kind clusters (one manager and 2 workers). We disconnect the network between the manager and a worker using that command: docker network disconnect kind kind-worker1-control-plane. Later we re-connect the clusters.

It is done to simulate transient connectivity issues between the clusters.

aojea · 2024-08-05T12:11:51Z

seems the other bug reported kubernetes/kubernetes#123313 with the same symptom was fixed by kubernetes/test-infra#32245

BenTheElder · 2024-08-05T20:13:35Z

This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.

Yeah, that's different, the process cleanup of the docker daemon is less concerning when we're succesfully deleting the node containers (which we did in that link, prior to the issue turning down the docker daemon), I don't think that's related but should also be tracked (kubernetes/test-infra#33227).

In that link we can see kind delete cluster successfully deleting the nodes without timeout issues.

so maybe this is some very rare one-off.

I think they may also run on different clusters (k8s-infra-prow-build vs the k8s infra EKS cluster) with different OS, machine type, etc. There may be some different quirk with the environment between them.

k8s-triage-robot · 2024-11-03T20:31:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Aug 1, 2024

k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 1, 2024

k8s-ci-robot assigned trasc Aug 1, 2024

trasc linked a pull request Aug 1, 2024 that will close this issue

[e2e] Ignore cluster delete failures #2745

Open

BenTheElder mentioned this issue Aug 5, 2024

dind not exiting gracefully kubernetes/test-infra#33227

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738

[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738

mimowo commented Aug 1, 2024

mimowo commented Aug 1, 2024

mimowo commented Aug 1, 2024

trasc commented Aug 1, 2024

trasc commented Aug 1, 2024 •

edited

Loading

BenTheElder commented Aug 2, 2024

mimowo commented Aug 2, 2024 •

edited

Loading

aojea commented Aug 2, 2024

mimowo commented Aug 5, 2024

aojea commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

k8s-triage-robot commented Nov 3, 2024

[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738

[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738

Comments

mimowo commented Aug 1, 2024

mimowo commented Aug 1, 2024

mimowo commented Aug 1, 2024

trasc commented Aug 1, 2024

trasc commented Aug 1, 2024 • edited Loading

BenTheElder commented Aug 2, 2024

mimowo commented Aug 2, 2024 • edited Loading

aojea commented Aug 2, 2024

mimowo commented Aug 5, 2024

aojea commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

k8s-triage-robot commented Nov 3, 2024

trasc commented Aug 1, 2024 •

edited

Loading

mimowo commented Aug 2, 2024 •

edited

Loading