-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flakes in clusterctl upgrade tests #11133
Comments
I'll have a look to the "Failed to create kind cluster" issue as I already noticed something similar on my own Kind setup and I think it's not isolated: kubernetes-sigs/kind#3554 - I guess it's something to fix upstream. EDIT: It seems to be an issue with inodes:
|
That sounds very suspicious regarding to Maybe would be a good start here to collect data about the actual used values :-) |
I don't know if we're running https://github.com/kubernetes/k8s.io/blob/3f2c06a3c547765e21dce65d0adcb1144a93b518/infra/aws/terraform/prow-build-cluster/resources/kube-system/tune-sysctls_daemonset.yaml#L4 there or not Also perhaps something else on the cluster is using a lot of them. |
I confirm the daemonset runs on the EKS cluster. |
Thanks folks for confirming that the daemon set is correctly setting the sysctl parameters - so the error might be elsewhere, I noticed something else while reading the logs1 of a failing test:
While on a non failing setup:
We can see that the Footnotes
|
It's possible? This part shouldn't really take long though .. I suspect that would be a noisy neighbor problem on the EKS cluster (I/O?) Doesn't explain the inotify-exhaustion like failures. |
We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h. We thought it's a nice way to save us time and the community money. Maybe we have to roll that back |
Do you remember when this change has been applied ? Those Kind failures seem to start by the end of August. |
That makes sense, ordinarily this part shouldn't take long, it doesn't need to fetch anything over the network and it should be pretty fast. But in a resource starved environment it might take to long. In that environment I would also expect Kubernetes to be unstable though, api-server/etcd will be timing out if we make it that far. |
carrying over some updates from the split off issue -- we have seen great improvements in the flakiness of the e2e tests after reverting the concurrency increase:
the updated plan/guidance for the rest of this release cycle re these e2e flakes is here: #11209 (comment) |
@cahillsf What is the current state? |
sorry @sbueringer was out on PTO, i just updated the description with the current status (one net new flake pattern) i did some investigation on the most frequent:
i was not able to find much indication in the logs about what might be going wrong here. could be some issue with the kube components but it seems we are not collecting these in the artifacts for this test. any thoughts as to how to better troubleshoot this? maybe ill open a PR to try to collect the |
@cahillsf No worries, welcome back :)
Not sure if you meant it that way, but I think the problem is not that CC is not getting reconciled. The problem is that we cannot deploy the DockerClusterTemplate:
The consequence of that is then that the ClusterClass cannot be reconciled. The question is, why is the CAPD webhook not up at this time. Can we do a quick PR to drop this in docker.yaml? - old: "--leader-elect"
new: "--leader-elect\n - --logging-format=json" Using JSON logging for e2e tests was a nice idea at the time, but overall it just makes troubleshooting without Loki/Grafana (which is what we do most of the time) pretty painful. |
yes thanks for clarifying
Sure thing, ill add this replacement for all of the v1.8 providers -- sound good? |
I would drop them entirely from docker.yaml |
oh oh i misread on the first pass (see you've already approved but dropping the PR here so it gets linked: #11318) |
Okay so what I was looking for was basically a way to correlate when the test failed and when CAPD was coming up. To then try to figure out what's going on. This should be easier with "regular" timestamps now. In general I would have assumed:
|
it does appear to, the management cluster is created with InitManagementClusterAndWatchControllerLogs which waits for all deployments to be available here by calling WaitForDeploymentsAvailable the logic seems pretty straightforward and sound here? not sure if anything sticks out to you
it does not, CreateOrUpdate is called on the can open a PR to add support for some retry option to be passed into CreateOrUpdate and see if that helps |
Probably fine. I assume waiting for
Ah okay, I thought we might have a retry. But it's probably a good thing. Adding a retry there might hide some problems I see potentially a gap between Deployment being Available / Pod being Ready and kube-proxy updating iptables so the webhook is actually reachable. I think it would be important to clarify the timeline:
Let's try to get those timestamps. If there is only a gap of 1-2 seconds we might want to retry CreateOrUpdate with dry-run until all webhooks work and then run CreateOrUpdate once without dry-run. |
taking this same failure example, [FAILED] Expected success, but got an error:
...
In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:451 @ 09/27/24 11:10:05.875 but the CAPD logs are reflecting the webhook server as up at I0927 11:09:59.835554 1 server.go:242] "Serving webhook server" logger="controller-runtime.webhook" host="" port=9443` so 6 seconds between the server as up and the call failing, not sure if we want to give the dry run approach a shot anyway?
|
I'm fine with giving the dry-run approach a shot. I think it's the easiest way for us to verify that the webhooks of all controllers are actually reachable. I don't expect that we'll see much in kube-proxy logs (but could be). I'm fine with also collecting logs from kube-system. Could be useful in general (e.g. kube-apiserver logs could be quite useful) |
circling back to this @sbueringer now the most frequent flake (#11133 (comment)) is only occurring on so it seems this dry-run approach worked (last flake on |
[I'm still lurking here but there's been a lot going on and it seems like you all are making progress 😅] |
no conclusions, but adding some findings re:
taking one specific example (have observed similar patterns in other failures) the test fails while migrating the CRs in the Note: the CRs are migrated before any changes are rolled out to the providers looking at the CAPD controller logs, we can see they start filling up with these errors shortly after the upgrade command is run:
the migration eventually times with this line
before the log stream cuts off, we can see these logs in the CAPD manager:
not exactly sure where to go next with this -- this |
Absolutely |
@cahillsf probably related: #10522 The best idea I have is rerunning clusterctl upgrade if we hit this error (but of course only for clusterctl v0.4.x because we can't fix it anymore) @chrischdi wdyt? |
This is testing the upgrade from v0.4.8 to v1.6.8 right? So the upgrade is done using clusterctl v1.6.8. So its about rerunning on v1.6.8. Retrying clusterctl upgrade may help and maybe is the easiest 👍 . Just to take a step back:
Another approach when retrying clusterctl upgrade is not good enough may be to run out-of-band the cert-manager upgrade (and upgrade to the same or a more recent version?!) + directly the CR migration including retries. Afterwards run the normal clusterctl upgrade as before. Note: and only for upgrades to v1.6.x :-) Offtopic: Shoutout to @cahillsf and others involved, awesome progress here! ❤️ |
Yeah, just had this thought. The clusterctl version that we can't fix shouldn't be v0.4 but a newer one |
@chrischdi Can we simply backport the longer timeout that you introduced in #10513 into release-1.7 and release-1.6? Do we know if that issue also occurs in tests that use clusterctl v1.5? If not I would consider doing one additional v1.6.x release and then we're good (we still have ProwJobs for release-1.6) |
+1 on backporting that and also cut a v1.6.9 for this. But v1.5 seems also to be affected: |
so maybe:
|
summarized by @chrischdi 🙇
According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.
3 Failures:
Internal error occurred: failed calling webhook [...] connect: connection refused
2 Failures:
x509: certificate signed by unknown authority
2 failures:
failed to run clusterctl version:
5 Failures:Timed out waiting for Machine Deployment clusterctl-upgrade/clusterctl-upgrade-workload-... to have 2 replicas
2 Failures:- resolved with: 🌱 test/e2e: decrease concurrency #11220Timed out waiting for Cluster clusterctl-upgrade/clusterctl-upgrade-workload-... to provision
36 failures:split off into: clusterctl upgradeTimed out waiting for all Machines to exist
Timed out waiting for all Machines to exist
#1120916 Failures:- resolved with: 🌱 test/e2e: decrease concurrency #11220Failed to create kind cluster
Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here
/kind flake
The text was updated successfully, but these errors were encountered: