🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

r4f4 · 2024-08-12T16:10:36Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times.

Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5032

Special notes for your reviewer:

This is an alternative approach to #4976 and #5033.

Checklist:

Release note:

Check for the LB "active" status instead of trying to resolve the DNS name to validate the LB is ready.

k8s-ci-robot · 2024-08-12T16:10:45Z

Hi @r4f4. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

r4f4 · 2024-08-12T16:11:55Z

In my local tests I see:

DEBUG I0812 14:31:14.346251     109 loadbalancer.go:115] "Created new network load balancer for apiserver" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:31:14.346269     109 loadbalancer.go:124] "Waiting for LB to become active" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:33:16.918691     109 loadbalancer.go:130] "LB reports active state" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int" time="2m2.57235285s"

I'm going to run Openshift e2e CI tests on this change to check for regressions.

/cc @patrickdillon

k8s-ci-robot · 2024-08-12T16:11:57Z

@r4f4: GitHub didn't allow me to request PR reviews from the following users: patrickdillon.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

In my local tests I see:

DEBUG I0812 14:31:14.346251     109 loadbalancer.go:115] "Created new network load balancer for apiserver" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:31:14.346269     109 loadbalancer.go:124] "Waiting for LB to become active" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:33:16.918691     109 loadbalancer.go:130] "LB reports active state" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int" time="2m2.57235285s"

I'm going to run Openshift e2e CI tests on this change to check for regressions.

/cc @patrickdillon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

* Fix for DNS name resolution too early: kubernetes-sigs/cluster-api-provider-aws#5093

r4f4 · 2024-08-12T19:37:26Z

I'm going to run Openshift e2e CI tests on this change to check for regressions.

I don't see any install issues in the e2e tests of openshift/installer#8825

r4f4 · 2024-08-12T21:10:15Z

@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh.

nrb · 2024-08-16T17:24:27Z

/ok-to-test

r4f4 · 2024-08-16T21:42:01Z

/test ?

k8s-ci-robot · 2024-08-16T21:42:16Z

@r4f4: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-build-docker
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e-eks-gc
/test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-build-docker
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

r4f4 · 2024-08-16T21:43:19Z

/test pull-cluster-api-provider-aws-e2e pull-cluster-api-provider-aws-e2e-conformance

AndiDog · 2024-08-30T06:03:11Z

@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh.

Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready. Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases.

r4f4 · 2024-08-30T06:28:20Z

Did you try such a scenario?

I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back.

r4f4 · 2024-08-30T09:29:59Z

Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario?

For what's worth, when we were using terraform for provisioning there was no such LB DNS name check. But the tf aws provider does wait for the LB to become active after creation: https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/elbv2/load_balancer.go#L406-L408

r4f4 · 2024-08-30T13:56:01Z

Did you try such a scenario?

I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back.

No install failures in 11 CI jobs: openshift/installer#8927.

AndiDog · 2024-09-05T22:32:00Z

/assign @kubernetes-sigs/cluster-api-provider-aws-maintainers

for approval

AndiDog · 2024-09-05T22:32:54Z

Sorry, posted that PR assignment to the wrong PR 🙄

r4f4 · 2024-09-24T15:13:20Z

Update: rebased on top of master so I could rebase openshift/installer#8927 and re-run the Openshift e2e tests.

r4f4 · 2024-09-25T15:27:06Z

@AndiDog I haven't seen a single regression in the OCP e2e tests so far. Anything else I can do to move this along?

patrickdillon · 2024-09-26T15:03:17Z

Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready.

Right, Chesterton's Fence says we need to understand it before we tear it down. We discussed the motivation in this slack thread.. There is not a complete consensus but to me the key takeaways/concerns are:

The check was introduced to make sure the LB is ready, which this PR does directly through the SDK
There are some concerns that this may cause issues with kubeadm or the join that you mentioend in the thread

I want to reiterate that the current DNS check only ensures that the DNS is resolvable from the manager--not the nodes. In our tests the nodes are always able to resolve the DNS quickly, but in some edge cases (like the DNS resolver for a distinguished engineer's local ISP or AWS GovCloud simulators) DNS can be very slow to propagate.

Would it just delay cluster readiness because CAP* controllers error out and retry?

So in our (openshift) case removing this check speeds up--not delays--the cluster readiness, because we do not block that the manager can resolve the DNS until it is needed. According to @vincepri in the thread, CAPI performs a similar check.

Would the cluster fail to come up?

We use a different bootstrapping process so I'm not as familiar with kubeadm as I would like, but the only potential for failure would be if the cluster coming up depends on the manager resolving the LB DNS name.

I would also point out that this current code arbitrarily assumes clusters use the LB DNS name, where it should technically be possible to create a DNS record pointing to the LB IPs (NLB IPs are static for their lifetime). That is an edge case, but we are indeed working on exactly that scenario.

Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases.

We are extensively testing this in openshift, but again our bootstrapping & install process has some differences with the standard CAPI flow. So far we have only found that the DNS check causes problems rather than preventing them.

richardcase · 2024-09-30T09:52:55Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

r4f4 · 2024-10-01T16:01:57Z

STEP: Failed to create CloudFormation stack in attempt 8: failed to create CF stack: failed to wait for AWS CloudFormation stack to be CreateComplete: ResourceNotReady: failed waiting for successful resource state @ 09/30/24 15:30:39.323

Let's try again:
/test pull-cluster-api-provider-aws-e2e

richardcase · 2024-10-01T16:10:59Z

STEP: Failed to create CloudFormation stack in attempt 8: failed to create CF stack: failed to wait for AWS CloudFormation stack to be CreateComplete: ResourceNotReady: failed waiting for successful resource state @ 09/30/24 15:30:39.323

Let's try again: /test pull-cluster-api-provider-aws-e2e

If it fails again we can see what AWS account is using and then investigate.

r4f4 · 2024-10-02T18:09:35Z

I suspect the timeout failures in the e2e are due to #5131

jonatasbaldin · 2024-10-09T19:40:00Z

Just for more context, I'm trying to deploy a AWS Cluster on sa-east-1 from a local Kind cluster and I'm stuck for 40 minutes already in the LB DNS name resolution.

Trying an online dig tool I can get the IP address, but it wasn't reflected on my network yet.

Having another method to check the LB status would be much appreciated.

damdo · 2024-10-23T14:48:43Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

damdo · 2024-10-23T14:52:12Z

Kicked the tests again now that CI seems to be working again.
If tests are green are we happy for this to merge? @r4f4 @richardcase @AndiDog ?

Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times. Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.

r4f4 · 2024-10-23T16:02:24Z

Update: rebased on top of current main.

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

damdo · 2024-10-23T20:31:33Z

/test pull-cluster-api-provider-aws-e2e-eks

damdo · 2024-10-24T09:18:19Z

The EKS e2e failures are not related to this PR, as I can see them In a no-op PR as well here.

r4f4 · 2024-10-25T20:16:22Z

/test pull-cluster-api-provider-aws-e2e-eks

richardcase · 2024-10-26T09:30:18Z

Both sets of e2e are passing and this only applies to ELBv2, so i think

/approve

k8s-ci-robot · 2024-10-26T09:30:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [richardcase]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

richardcase · 2024-10-26T09:31:06Z

For lgtm:

/assign damdo
/assign nrb
/assign AndiDog

damdo · 2024-10-26T10:43:53Z

Considering the extensive testing conducted by @r4f4 and team, our CI being happy with this and in general the change introducing more robust and reliable checking than a DNS based one
(Also it looks like @AndiDog was happy with it here: #5093 (comment))

/lgtm

k8s-ci-robot requested review from AndiDog and killianmuldoon August 12, 2024 16:10

r4f4 added a commit to r4f4/installer that referenced this pull request Aug 12, 2024

WIP: OCPBUGS-35898: capi/aws: bump provider for LB DNS fix - 2

998aa3a

* Fix for DNS name resolution too early: kubernetes-sigs/cluster-api-provider-aws#5093

r4f4 mentioned this pull request Aug 12, 2024

WIP: OCPBUGS-35898: capi/aws: bump provider for LB DNS fix - 2 openshift/installer#8825

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 16, 2024

r4f4 force-pushed the fix-5032-approach-2 branch from 665e6e5 to a880668 Compare September 24, 2024 15:07

r4f4 mentioned this pull request Sep 30, 2024

🐛 wait for lb dns name to propagate before resolving #5033

Closed

5 tasks

r4f4 mentioned this pull request Oct 19, 2024

Load Balancer DNS resolution takes too long #5032

Closed

damdo added this to the v2.7.0 milestone Oct 19, 2024

r4f4 force-pushed the fix-5032-approach-2 branch from a880668 to 32ebcce Compare October 23, 2024 15:43

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2024

k8s-ci-robot assigned AndiDog, damdo and nrb Oct 26, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2024

k8s-ci-robot merged commit 5b4f7c1 into kubernetes-sigs:main Oct 26, 2024
23 checks passed

r4f4 mentioned this pull request Oct 26, 2024

OCPBUGS-36222: OCPBUGS-35898: capi/aws: bump provider for LB DNS lookup fix openshift/installer#8927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

r4f4 commented Aug 12, 2024

nrb commented Aug 16, 2024

r4f4 commented Aug 16, 2024

k8s-ci-robot commented Aug 16, 2024

r4f4 commented Aug 16, 2024

AndiDog commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

AndiDog commented Sep 5, 2024

AndiDog commented Sep 5, 2024

r4f4 commented Sep 24, 2024

r4f4 commented Sep 25, 2024

patrickdillon commented Sep 26, 2024

richardcase commented Sep 30, 2024

r4f4 commented Oct 1, 2024

richardcase commented Oct 1, 2024

r4f4 commented Oct 2, 2024

jonatasbaldin commented Oct 9, 2024

damdo commented Oct 23, 2024

damdo commented Oct 23, 2024

r4f4 commented Oct 23, 2024

damdo commented Oct 23, 2024

damdo commented Oct 24, 2024

r4f4 commented Oct 25, 2024

richardcase commented Oct 26, 2024

k8s-ci-robot commented Oct 26, 2024

richardcase commented Oct 26, 2024

damdo commented Oct 26, 2024

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

Conversation

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

r4f4 commented Aug 12, 2024

nrb commented Aug 16, 2024

r4f4 commented Aug 16, 2024

k8s-ci-robot commented Aug 16, 2024

r4f4 commented Aug 16, 2024

AndiDog commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

AndiDog commented Sep 5, 2024

AndiDog commented Sep 5, 2024

r4f4 commented Sep 24, 2024

r4f4 commented Sep 25, 2024

patrickdillon commented Sep 26, 2024

richardcase commented Sep 30, 2024

r4f4 commented Oct 1, 2024

richardcase commented Oct 1, 2024

r4f4 commented Oct 2, 2024

jonatasbaldin commented Oct 9, 2024

damdo commented Oct 23, 2024

damdo commented Oct 23, 2024

r4f4 commented Oct 23, 2024

damdo commented Oct 23, 2024

damdo commented Oct 24, 2024

r4f4 commented Oct 25, 2024

richardcase commented Oct 26, 2024

k8s-ci-robot commented Oct 26, 2024

richardcase commented Oct 26, 2024

damdo commented Oct 26, 2024