flakes in clusterctl upgrade tests #11133

cahillsf · 2024-09-03T21:09:58Z

summarized by @chrischdi 🙇

According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.

3 Failures: Internal error occurred: failed calling webhook [...] connect: connection refused
- Component: CAPD
- Branches:
  - main
  - release-1.8
2 Failures: x509: certificate signed by unknown authority
- Component: unknown
- Branches:
  - main
  - release-1.8
  - release-1.7
2 failures: failed to run clusterctl version:
- Component: clusterctl/test framework it seems
- Branches:
  - release-1.7
  - release-1.6
~~5 Failures~~: Timed out waiting for Machine Deployment clusterctl-upgrade/clusterctl-upgrade-workload-... to have 2 replicas
- [2024-10-21 update] no failures in recent history but did have some as recently as 2024-10-03. Unclear what introduced the improvement (this was relatively infrequent to begin with)
- Component: unknown
- Branches:
  - release-1.8
  - main
~~2 Failures: Timed out waiting for Cluster clusterctl-upgrade/clusterctl-upgrade-workload-... to provision~~ - resolved with: 🌱 test/e2e: decrease concurrency #11220
- Component: unknown
- Branches:
  - release-1.8
  - main
~~36 failures: Timed out waiting for all Machines to exist~~ split off into: clusterctl upgrade Timed out waiting for all Machines to exist #11209
~~16 Failures: Failed to create kind cluster~~ - resolved with: 🌱 test/e2e: decrease concurrency #11220
- Component: e2e setup
- Branches:
  - main
  - release-1.7

Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here

/kind flake

The text was updated successfully, but these errors were encountered:

tormath1 · 2024-09-04T12:42:02Z

I'll have a look to the "Failed to create kind cluster" issue as I already noticed something similar on my own Kind setup and I think it's not isolated: kubernetes-sigs/kind#3554 - I guess it's something to fix upstream.

EDIT: It seems to be an issue with inodes:

$ kind create cluster --retain --name=cluster3
Creating cluster "cluster3" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
$ podman logs -f 7eb0838e6bb2
...
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 12 (bookworm)!

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

chrischdi · 2024-09-04T17:51:52Z

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files

That sounds very suspicious regarding to
https://main.cluster-api.sigs.k8s.io/user/troubleshooting.html?highlight=sysctl#cluster-api-with-docker----too-many-open-files

Maybe would be a good start here to collect data about the actual used values :-)

BenTheElder · 2024-09-04T18:53:50Z

"cluster": "eks-prow-build-cluster",

I don't know if we're running https://github.com/kubernetes/k8s.io/blob/3f2c06a3c547765e21dce65d0adcb1144a93b518/infra/aws/terraform/prow-build-cluster/resources/kube-system/tune-sysctls_daemonset.yaml#L4 there or not

Also perhaps something else on the cluster is using a lot of them.

ameukam · 2024-09-04T22:34:59Z

I confirm the daemonset runs on the EKS cluster.

tormath1 · 2024-09-05T12:31:53Z

Thanks folks for confirming that the daemon set is correctly setting the sysctl parameters - so the error might be elsewhere, I noticed something else while reading the logs¹ of a failing test:

$ cat journal.log | grep -i "Journal started"
Aug 30 06:35:27 clusterctl-upgrade-management-fba3o1-control-plane systemd-journald[95]: Journal started
$ cat journal.log | grep -i "multi-user"
Aug 30 06:35:51 clusterctl-upgrade-management-fba3o1-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.

While on a non failing setup:

root@kind-control-plane:/# journalctl | grep -i "multi-user"
Sep 05 12:16:31 kind-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.
root@kind-control-plane:/# journalctl | grep -i "Journal started"
Sep 05 12:16:31 kind-control-plane systemd-journald[98]: Journal started

We can see that the multi-user.target² is reached at the same time as the journal started to log. On a failing test, there is a already 24 seconds of difference. I'm wondering if randomly (under heavy load) we don't reach the 30 seconds of timeout³ for reaching the multi-user.target hence the failure.

BenTheElder · 2024-09-09T23:03:09Z

It's possible? This part shouldn't really take long though ..

I suspect that would be a noisy neighbor problem on the EKS cluster (I/O?)

Doesn't explain the inotify-exhaustion like failures.

sbueringer · 2024-09-10T06:16:14Z

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

tormath1 · 2024-09-10T09:16:02Z

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

Do you remember when this change has been applied ? Those Kind failures seem to start by the end of August.

sbueringer · 2024-09-10T10:20:59Z

main: #11067
release-1.8: #11144

cahillsf · 2024-09-24T20:47:16Z

related note ( i split off one of the messages into a separate issue #11209 just so this conversation didn't get too convoluted) -- we are going to try reverting the concurrency increase and see if the situation improves

#11220
#11222

BenTheElder · 2024-09-24T21:35:22Z

We can see that the multi-user.target2 is reached at the same time as the journal started to log. On a failing test, there is a already 24 seconds of difference. I'm wondering if randomly (under heavy load) we don't reach the 30 seconds of timeout3 for reaching the multi-user.target hence the failure.

That makes sense, ordinarily this part shouldn't take long, it doesn't need to fetch anything over the network and it should be pretty fast.

But in a resource starved environment it might take to long. In that environment I would also expect Kubernetes to be unstable though, api-server/etcd will be timing out if we make it that far.

cahillsf · 2024-10-01T15:41:14Z

carrying over some updates from the split off issue -- we have seen great improvements in the flakiness of the e2e tests after reverting the concurrency increase:

clusterctl upgrade Timed out waiting for all Machines to exist #11209 (comment)
clusterctl upgrade Timed out waiting for all Machines to exist #11209 (comment)

the updated plan/guidance for the rest of this release cycle re these e2e flakes is here: #11209 (comment)

sbueringer · 2024-10-16T12:36:57Z

@cahillsf What is the current state?

cahillsf · 2024-10-21T21:30:38Z

@cahillsf What is the current state?

sorry @sbueringer was out on PTO, i just updated the description with the current status (one net new flake pattern)

i did some investigation on the most frequent:

3 Failures: Internal error occurred: failed calling webhook [...] connect: connection refused

failing example where the ClusterClass is not getting reconciled:
- https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1839604474990039040
- https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1839604474990039040/artifacts/clusters/clusterctl-upgrade-management-yu9mot/resources/clusterctl-upgrade/ClusterClass/quick-start.yaml (variables reconciled - RefVersionsUpToDate not True)
- CAPI logs: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1839604474990039040/artifacts/clusters/clusterctl-upgrade-management-yu9mot/logs/capi-system/capi-controller-manager/capi-controller-manager-768dc55749-t8gwj/manager.log (clusterclass failing to reconcile - observed generation not incremented)

i was not able to find much indication in the logs about what might be going wrong here. could be some issue with the kube components but it seems we are not collecting these in the artifacts for this test. any thoughts as to how to better troubleshoot this? maybe ill open a PR to try to collect the kube-system logs for this test?

sbueringer · 2024-10-22T08:20:21Z

@cahillsf No worries, welcome back :)

failing example where the ClusterClass is not getting reconciled

Not sure if you meant it that way, but I think the problem is not that CC is not getting reconciled. The problem is that we cannot deploy the DockerClusterTemplate:

Internal error occurred: failed calling webhook "default.dockerclustertemplate.infrastructure.cluster.x-k8s.io":
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1839604474990039040

The consequence of that is then that the ClusterClass cannot be reconciled.

The question is, why is the CAPD webhook not up at this time.

Can we do a quick PR to drop this in docker.yaml?

    - old: "--leader-elect"
      new: "--leader-elect\n        - --logging-format=json"

Using JSON logging for e2e tests was a nice idea at the time, but overall it just makes troubleshooting without Loki/Grafana (which is what we do most of the time) pretty painful.

cahillsf · 2024-10-22T13:01:37Z

yes thanks for clarifying

Can we do a quick PR to drop this in docker.yaml?

Sure thing, ill add this replacement for all of the v1.8 providers -- sound good?

sbueringer · 2024-10-22T13:54:03Z

I would drop them entirely from docker.yaml

cahillsf · 2024-10-22T14:41:09Z

oh oh i misread on the first pass (see you've already approved but dropping the PR here so it gets linked: #11318)

sbueringer · 2024-10-22T17:11:04Z

Okay so what I was looking for was basically a way to correlate when the test failed and when CAPD was coming up. To then try to figure out what's going on. This should be easier with "regular" timestamps now.

In general I would have assumed:

that clusterctl waits until providers are up (maybe worth to check)
that the kubectl apply has some sort of retry so it doesn't fail immediately (also worth to check)

cahillsf · 2024-10-23T00:56:14Z

that clusterctl waits until providers are up (maybe worth to check)

it does appear to, the management cluster is created with InitManagementClusterAndWatchControllerLogs which waits for all deployments to be available here by calling WaitForDeploymentsAvailable

the logic seems pretty straightforward and sound here? not sure if anything sticks out to you

that the kubectl apply has some sort of retry so it doesn't fail immediately (also worth to check)

it does not, CreateOrUpdate is called on the managementClusterProxy which is defined here and has no retry logic

can open a PR to add support for some retry option to be passed into CreateOrUpdate and see if that helps

sbueringer · 2024-10-23T08:48:26Z

the logic seems pretty straightforward and sound here? not sure if anything sticks out to you

Probably fine. I assume waiting for Available to be true on a Deployment means that the capi-controller Pod is Ready? (which would include readinessProbes to work which includes webhook server to be up)

It does not, CreateOrUpdate is called on the managementClusterProxy which is defined here and has no retry logic
can open a PR to add support for some retry option to be passed into CreateOrUpdate and see if that helps

Ah okay, I thought we might have a retry. But it's probably a good thing. Adding a retry there might hide some problems

I see potentially a gap between Deployment being Available / Pod being Ready and kube-proxy updating iptables so the webhook is actually reachable.

I think it would be important to clarify the timeline:

based on Pod timestamps, when is the webhook server actually up (to confirm clusterctl is waiting correctly)
when is the request failing

Let's try to get those timestamps.

If there is only a gap of 1-2 seconds we might want to retry CreateOrUpdate with dry-run until all webhooks work and then run CreateOrUpdate once without dry-run.

cahillsf · 2024-10-23T13:02:18Z

I think it would be important to clarify the timeline:

based on Pod timestamps, when is the webhook server actually up (to confirm clusterctl is waiting correctly)

when is the request failing
Let's try to get those timestamps.

If there is only a gap of 1-2 seconds we might want to retry CreateOrUpdate with dry-run until all webhooks work and then run CreateOrUpdate once without dry-run.

taking this same failure example, CreateOrUpdate call fails at 09/27/24 11:10:05.875:

[FAILED] Expected success, but got an error:
...
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:451 @ 09/27/24 11:10:05.875

but the CAPD logs are reflecting the webhook server as up at I0927 11:09:59.835554:

I0927 11:09:59.835554       1 server.go:242] "Serving webhook server" logger="controller-runtime.webhook" host="" port=9443`

so 6 seconds between the server as up and the call failing, not sure if we want to give the dry run approach a shot anyway?
maybe we add collection of kube-system logs to further investigate this?

I see potentially a gap between Deployment being Available / Pod being Ready and kube-proxy updating iptables so the webhook is actually reachable.

sbueringer · 2024-10-23T13:45:54Z

I'm fine with giving the dry-run approach a shot. I think it's the easiest way for us to verify that the webhooks of all controllers are actually reachable.

I don't expect that we'll see much in kube-proxy logs (but could be). I'm fine with also collecting logs from kube-system. Could be useful in general (e.g. kube-apiserver logs could be quite useful)

cahillsf · 2024-11-18T23:15:44Z

circling back to this @sbueringer now the most frequent flake (#11133 (comment)) is only occurring on release-1.8 branches: https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-18&text=connect%3A%20connection%20refused&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades

so it seems this dry-run approach worked

(last flake on main is visible on this timeline and occurred 10/28 prior to the dry-run update merge on 10/29). do you think that is enough data to backport the changes to release-1.8?

BenTheElder · 2024-11-19T00:44:21Z

[I'm still lurking here but there's been a lot going on and it seems like you all are making progress 😅]

cahillsf · 2024-11-19T02:05:07Z

no conclusions, but adding some findings re:

2 Failures: x509: certificate signed by unknown authority

taking one specific example (have observed similar patterns in other failures) the test fails while migrating the CRs in the UpgradeManagementClusterAndWait process initiated here in the e2e test from v0.4.8 -> v1.6.8.

Note: the CRs are migrated before any changes are rolled out to the providers

looking at the CAPD controller logs, we can see they start filling up with these errors shortly after the upgrade command is run:

2024/11/02 01:44:37 http: TLS handshake error from 10.244.0.1:28755: remote error: tls: bad certificate
2024/11/02 01:44:38 http: TLS handshake error from 10.244.0.1:20689: remote error: tls: bad certificate
2024/11/02 01:44:39 http: TLS handshake error from 10.244.0.1:29264: remote error: tls: bad certificate
2024/11/02 01:44:40 http: TLS handshake error from 10.244.0.1:10669: remote error: tls: bad certificate

the migration eventually times with this line @ 11/02/24 01:45:01.663:

Error: failed to migrate clusterctl-upgrade/clusterctl-upgrade-workload-zszpga-control-plane: action failed after 10 attempts: Internal error occurred: failed calling webhook "validation.dockermachinetemplate.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capd-webhook-service.capd-system.svc:443/validate-infrastructure-cluster-x-k8s-io-v1alpha4-dockermachinetemplate?timeout=10s": x509: certificate signed by unknown authority

before the log stream cuts off, we can see these logs in the CAPD manager:

I1102 01:45:18.422998       1 deleg.go:130] controller-runtime/certwatcher "msg"="Updated current TLS certificate"  
I1102 01:45:18.423382       1 deleg.go:130] controller-runtime/certwatcher "msg"="Updated current TLS certificate"

not exactly sure where to go next with this -- this v0.4.8 version is using sigs.k8s.io/controller-runtime v0.9.7 maybe there is some issue with the certwatcher in this version?

sbueringer · 2024-11-20T05:16:38Z

(last flake on main is visible on this timeline and occurred 10/28 prior to the dry-run update merge on 10/29). do you think that is enough data to backport the changes to release-1.8?

Absolutely

sbueringer · 2024-11-20T05:26:22Z

@cahillsf probably related: #10522

The best idea I have is rerunning clusterctl upgrade if we hit this error (but of course only for clusterctl v0.4.x because we can't fix it anymore)

@chrischdi wdyt?

chrischdi · 2024-11-20T09:22:37Z

@cahillsf probably related: #10522

The best idea I have is rerunning clusterctl upgrade if we hit this error (but of course only for clusterctl v0.4.x because we can't fix it anymore)

@chrischdi wdyt?

This is testing the upgrade from v0.4.8 to v1.6.8 right? So the upgrade is done using clusterctl v1.6.8. So its about rerunning on v1.6.8.

Retrying clusterctl upgrade may help and maybe is the easiest 👍 .

Just to take a step back:

We start clusterctl upgrade which
- Upgrades cert-manager (to an old version)
- tries to migrate CRDs and then fails here

Another approach when retrying clusterctl upgrade is not good enough may be to run out-of-band the cert-manager upgrade (and upgrade to the same or a more recent version?!) + directly the CR migration including retries.

Afterwards run the normal clusterctl upgrade as before.

Note: and only for upgrades to v1.6.x :-)

Offtopic: Shoutout to @cahillsf and others involved, awesome progress here! ❤️

sbueringer · 2024-11-20T09:36:40Z

Yeah, just had this thought. The clusterctl version that we can't fix shouldn't be v0.4 but a newer one

sbueringer · 2024-11-20T09:41:37Z

@chrischdi Can we simply backport the longer timeout that you introduced in #10513 into release-1.7 and release-1.6?

Do we know if that issue also occurs in tests that use clusterctl v1.5?

If not I would consider doing one additional v1.6.x release and then we're good (we still have ProwJobs for release-1.6)

chrischdi · 2024-11-20T17:50:38Z

+1 on backporting that and also cut a v1.6.9 for this.

But v1.5 seems also to be affected:

link

chrischdi · 2024-11-20T17:53:30Z

so maybe:

backport the timeout for 1.7
retry upgrade for 1.5.x and v1.6.x? (-> for clusterctl binary <= v1.6)
- the second retry should succeed as cert-manager should get stable

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 3, 2024

cahillsf mentioned this issue Sep 3, 2024

clusterctl upgrade tests are flaky #9688

Closed

cahillsf added this to CAPI v1.9 release improvement tasks Sep 3, 2024

fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 5, 2024

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Sep 5, 2024

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 5, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 5, 2024

fabriziopandini added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 5, 2024

sbueringer mentioned this issue Sep 10, 2024

🐛 Make KCP pre-terminate hook more robust #11161

Merged

sbueringer mentioned this issue Sep 20, 2024

clusterctl upgrade Timed out waiting for all Machines to exist #11209

Closed

cahillsf mentioned this issue Oct 22, 2024

🌱 Remove json logging replacement in e2e test config #11318

Merged

cahillsf mentioned this issue Oct 29, 2024

🌱 Add dry-run CreateOrUpdate call in clusterctl upgrade e2e tests #11349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flakes in clusterctl upgrade tests #11133

flakes in clusterctl upgrade tests #11133

cahillsf commented Sep 3, 2024 •

edited

Loading

tormath1 commented Sep 4, 2024 •

edited

Loading

chrischdi commented Sep 4, 2024

BenTheElder commented Sep 4, 2024

ameukam commented Sep 4, 2024

tormath1 commented Sep 5, 2024

BenTheElder commented Sep 9, 2024

sbueringer commented Sep 10, 2024

tormath1 commented Sep 10, 2024

sbueringer commented Sep 10, 2024

cahillsf commented Sep 24, 2024

BenTheElder commented Sep 24, 2024 •

edited

Loading

cahillsf commented Oct 1, 2024

sbueringer commented Oct 16, 2024

cahillsf commented Oct 21, 2024

sbueringer commented Oct 22, 2024 •

edited

Loading

cahillsf commented Oct 22, 2024 •

edited

Loading

sbueringer commented Oct 22, 2024

cahillsf commented Oct 22, 2024

sbueringer commented Oct 22, 2024 •

edited

Loading

cahillsf commented Oct 23, 2024

sbueringer commented Oct 23, 2024

cahillsf commented Oct 23, 2024

sbueringer commented Oct 23, 2024 •

edited

Loading

cahillsf commented Nov 18, 2024 •

edited

Loading

BenTheElder commented Nov 19, 2024

cahillsf commented Nov 19, 2024

sbueringer commented Nov 20, 2024

sbueringer commented Nov 20, 2024

chrischdi commented Nov 20, 2024 •

edited

Loading

sbueringer commented Nov 20, 2024

sbueringer commented Nov 20, 2024 •

edited

Loading

chrischdi commented Nov 20, 2024 •

edited

Loading

chrischdi commented Nov 20, 2024 •

edited

Loading

flakes in clusterctl upgrade tests #11133

flakes in clusterctl upgrade tests #11133

Comments

cahillsf commented Sep 3, 2024 • edited Loading

tormath1 commented Sep 4, 2024 • edited Loading

chrischdi commented Sep 4, 2024

BenTheElder commented Sep 4, 2024

ameukam commented Sep 4, 2024

tormath1 commented Sep 5, 2024

Footnotes

BenTheElder commented Sep 9, 2024

sbueringer commented Sep 10, 2024

tormath1 commented Sep 10, 2024

sbueringer commented Sep 10, 2024

cahillsf commented Sep 24, 2024

BenTheElder commented Sep 24, 2024 • edited Loading

cahillsf commented Oct 1, 2024

sbueringer commented Oct 16, 2024

cahillsf commented Oct 21, 2024

sbueringer commented Oct 22, 2024 • edited Loading

cahillsf commented Oct 22, 2024 • edited Loading

sbueringer commented Oct 22, 2024

cahillsf commented Oct 22, 2024

sbueringer commented Oct 22, 2024 • edited Loading

cahillsf commented Oct 23, 2024

sbueringer commented Oct 23, 2024

cahillsf commented Oct 23, 2024

sbueringer commented Oct 23, 2024 • edited Loading

cahillsf commented Nov 18, 2024 • edited Loading

BenTheElder commented Nov 19, 2024

cahillsf commented Nov 19, 2024

sbueringer commented Nov 20, 2024

sbueringer commented Nov 20, 2024

chrischdi commented Nov 20, 2024 • edited Loading

sbueringer commented Nov 20, 2024

sbueringer commented Nov 20, 2024 • edited Loading

chrischdi commented Nov 20, 2024 • edited Loading

chrischdi commented Nov 20, 2024 • edited Loading

cahillsf commented Sep 3, 2024 •

edited

Loading

tormath1 commented Sep 4, 2024 •

edited

Loading

BenTheElder commented Sep 24, 2024 •

edited

Loading

sbueringer commented Oct 22, 2024 •

edited

Loading

cahillsf commented Oct 22, 2024 •

edited

Loading

sbueringer commented Oct 22, 2024 •

edited

Loading

sbueringer commented Oct 23, 2024 •

edited

Loading

cahillsf commented Nov 18, 2024 •

edited

Loading

chrischdi commented Nov 20, 2024 •

edited

Loading

sbueringer commented Nov 20, 2024 •

edited

Loading

chrischdi commented Nov 20, 2024 •

edited

Loading

chrischdi commented Nov 20, 2024 •

edited

Loading