-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
What steps did you take and what happened?
CAPZ's e2e tests perform a clusterctl.Init immediately followed by ApplyClusterTemplateAndWait where the webhooks for the added providers often aren't ready by the time the first object is created, resulting in an error on the first attempt:
Internal error occurred: failed calling webhook "mrke2controlplanetemplate.kb.io": failed to call webhook: Post "https://rke2-control-plane-webhook-service.rke2-control-plane-system.svc:443/mutate-controlplane-cluster-x-k8s-io-v1beta1-rke2controlplanetemplate?timeout=10s": dial tcp 10.96.167.34:443: connect: connection refused
I'd expect the next try after the webhooks become ready to succeed. Instead, I see the same error for every subsequent attempt:
admission webhook "validation.cluster.cluster.x-k8s.io" denied the request: Cluster.cluster.x-k8s.io "capz-e2e-fxe4ui-cc" is invalid: [spec.infrastructureRef: Forbidden: cannot be removed, spec.controlPlaneRef: Forbidden: cannot be removed
This is the full sequence of events:
- First attempt in
ApplyClusterTemplateAndWaittoCreateOrUpdatefails to create a RKE2ControlPlaneTemplate because the webhook isn't ready. CreateOrUpdatepresses on, creating the Cluster with aspec.topology, still in that first attempt.- In parallel, the topology controller adds a
controlPlaneRefandinfrastructureRefto the Cluster andApplyClusterTemplateAndWaittriesCreateOrUpdateagain. - When the topology controller makes its changes before the second
CreateOrUpdateattempt, the secondCreateOrUpdateattempt tries to HTTPPUTthe entire Cluster again, undoing the topology controller's changes. This is correctly identified by the Cluster validating webhooks as an invalid update and rejects the change. - Repeat for all future retries of
CreateOrUpdate.
This is only one example, but the issue affects the general case where CreateOrUpdate creates an object, that object is updated monotonically (by a controller, webhook, etc.), CreateOrUpdate tries again in response to a failure unrelated to that object, and that attempt fails because reversing the monotonic update is invalid.
What did you expect to happen?
Retries succeed when the initial error condition is resolved.
Cluster API version
v1.11.5
Kubernetes version
No response
Anything else you would like to add?
I've experimented locally with using server-side apply within CreateOrUpdate and that fixes this specific case. I didn't audit all other usage of that method though.
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
/area e2e-testing