Skip to content

Retries in e2e ApplyClusterTemplateAndWait race with the Cluster topology controller #13264

@nojnhuh

Description

@nojnhuh

What steps did you take and what happened?

CAPZ's e2e tests perform a clusterctl.Init immediately followed by ApplyClusterTemplateAndWait where the webhooks for the added providers often aren't ready by the time the first object is created, resulting in an error on the first attempt:

Internal error occurred: failed calling webhook "mrke2controlplanetemplate.kb.io": failed to call webhook: Post "https://rke2-control-plane-webhook-service.rke2-control-plane-system.svc:443/mutate-controlplane-cluster-x-k8s-io-v1beta1-rke2controlplanetemplate?timeout=10s": dial tcp 10.96.167.34:443: connect: connection refused

I'd expect the next try after the webhooks become ready to succeed. Instead, I see the same error for every subsequent attempt:

admission webhook "validation.cluster.cluster.x-k8s.io" denied the request: Cluster.cluster.x-k8s.io "capz-e2e-fxe4ui-cc" is invalid: [spec.infrastructureRef: Forbidden: cannot be removed, spec.controlPlaneRef: Forbidden: cannot be removed

This is the full sequence of events:

  1. First attempt in ApplyClusterTemplateAndWait to CreateOrUpdate fails to create a RKE2ControlPlaneTemplate because the webhook isn't ready.
  2. CreateOrUpdate presses on, creating the Cluster with a spec.topology, still in that first attempt.
  3. In parallel, the topology controller adds a controlPlaneRef and infrastructureRef to the Cluster and ApplyClusterTemplateAndWait tries CreateOrUpdate again.
  4. When the topology controller makes its changes before the second CreateOrUpdate attempt, the second CreateOrUpdate attempt tries to HTTP PUT the entire Cluster again, undoing the topology controller's changes. This is correctly identified by the Cluster validating webhooks as an invalid update and rejects the change.
  5. Repeat for all future retries of CreateOrUpdate.

This is only one example, but the issue affects the general case where CreateOrUpdate creates an object, that object is updated monotonically (by a controller, webhook, etc.), CreateOrUpdate tries again in response to a failure unrelated to that object, and that attempt fails because reversing the monotonic update is invalid.

What did you expect to happen?

Retries succeed when the initial error condition is resolved.

Cluster API version

v1.11.5

Kubernetes version

No response

Anything else you would like to add?

I've experimented locally with using server-side apply within CreateOrUpdate and that fixes this specific case. I didn't audit all other usage of that method though.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
/area e2e-testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/e2e-testingIssues or PRs related to e2e testingkind/bugCategorizes issue or PR as related to a bug.needs-priorityIndicates an issue lacks a `priority/foo` label and requires one.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions