Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

Merged
merged 1 commit into from
Oct 26, 2024

Conversation

r4f4
Copy link
Contributor

@r4f4 r4f4 commented Aug 12, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times.

Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5032

Special notes for your reviewer:

This is an alternative approach to #4976 and #5033.

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

Check for the LB "active" status instead of trying to resolve the DNS name to validate the LB is ready.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 12, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @r4f4. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@r4f4
Copy link
Contributor Author

r4f4 commented Aug 12, 2024

In my local tests I see:

DEBUG I0812 14:31:14.346251     109 loadbalancer.go:115] "Created new network load balancer for apiserver" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:31:14.346269     109 loadbalancer.go:124] "Waiting for LB to become active" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:33:16.918691     109 loadbalancer.go:130] "LB reports active state" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int" time="2m2.57235285s"

I'm going to run Openshift e2e CI tests on this change to check for regressions.

/cc @patrickdillon

@k8s-ci-robot
Copy link
Contributor

@r4f4: GitHub didn't allow me to request PR reviews from the following users: patrickdillon.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

In my local tests I see:

DEBUG I0812 14:31:14.346251     109 loadbalancer.go:115] "Created new network load balancer for apiserver" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:31:14.346269     109 loadbalancer.go:124] "Waiting for LB to become active" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:33:16.918691     109 loadbalancer.go:130] "LB reports active state" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int" time="2m2.57235285s"

I'm going to run Openshift e2e CI tests on this change to check for regressions.

/cc @patrickdillon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@r4f4
Copy link
Contributor Author

r4f4 commented Aug 12, 2024

I'm going to run Openshift e2e CI tests on this change to check for regressions.

I don't see any install issues in the e2e tests of openshift/installer#8825

@r4f4
Copy link
Contributor Author

r4f4 commented Aug 12, 2024

@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh.

@nrb
Copy link
Contributor

nrb commented Aug 16, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 16, 2024
@r4f4
Copy link
Contributor Author

r4f4 commented Aug 16, 2024

/test ?

@k8s-ci-robot
Copy link
Contributor

@r4f4: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@r4f4
Copy link
Contributor Author

r4f4 commented Aug 16, 2024

/test pull-cluster-api-provider-aws-e2e pull-cluster-api-provider-aws-e2e-conformance

@AndiDog
Copy link
Contributor

AndiDog commented Aug 30, 2024

@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh.

Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready. Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases.

@r4f4
Copy link
Contributor Author

r4f4 commented Aug 30, 2024

Did you try such a scenario?

I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back.

@r4f4
Copy link
Contributor Author

r4f4 commented Aug 30, 2024

Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario?

For what's worth, when we were using terraform for provisioning there was no such LB DNS name check. But the tf aws provider does wait for the LB to become active after creation: https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/elbv2/load_balancer.go#L406-L408

@r4f4
Copy link
Contributor Author

r4f4 commented Aug 30, 2024

Did you try such a scenario?

I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back.

No install failures in 11 CI jobs: openshift/installer#8927.

@AndiDog
Copy link
Contributor

AndiDog commented Sep 5, 2024

/assign @kubernetes-sigs/cluster-api-provider-aws-maintainers

for approval

@AndiDog
Copy link
Contributor

AndiDog commented Sep 5, 2024

Sorry, posted that PR assignment to the wrong PR 🙄

@r4f4
Copy link
Contributor Author

r4f4 commented Sep 24, 2024

Update: rebased on top of master so I could rebase openshift/installer#8927 and re-run the Openshift e2e tests.

@r4f4
Copy link
Contributor Author

r4f4 commented Sep 25, 2024

@AndiDog I haven't seen a single regression in the OCP e2e tests so far. Anything else I can do to move this along?

@patrickdillon
Copy link

Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready.

Right, Chesterton's Fence says we need to understand it before we tear it down. We discussed the motivation in this slack thread.. There is not a complete consensus but to me the key takeaways/concerns are:

  • The check was introduced to make sure the LB is ready, which this PR does directly through the SDK
  • There are some concerns that this may cause issues with kubeadm or the join that you mentioend in the thread

I want to reiterate that the current DNS check only ensures that the DNS is resolvable from the manager--not the nodes. In our tests the nodes are always able to resolve the DNS quickly, but in some edge cases (like the DNS resolver for a distinguished engineer's local ISP or AWS GovCloud simulators) DNS can be very slow to propagate.

Would it just delay cluster readiness because CAP* controllers error out and retry?

So in our (openshift) case removing this check speeds up--not delays--the cluster readiness, because we do not block that the manager can resolve the DNS until it is needed. According to @vincepri in the thread, CAPI performs a similar check.

Would the cluster fail to come up?

We use a different bootstrapping process so I'm not as familiar with kubeadm as I would like, but the only potential for failure would be if the cluster coming up depends on the manager resolving the LB DNS name.

I would also point out that this current code arbitrarily assumes clusters use the LB DNS name, where it should technically be possible to create a DNS record pointing to the LB IPs (NLB IPs are static for their lifetime). That is an edge case, but we are indeed working on exactly that scenario.

Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases.

We are extensively testing this in openshift, but again our bootstrapping & install process has some differences with the standard CAPI flow. So far we have only found that the DNS check causes problems rather than preventing them.

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@r4f4
Copy link
Contributor Author

r4f4 commented Oct 1, 2024

STEP: Failed to create CloudFormation stack in attempt 8: failed to create CF stack: failed to wait for AWS CloudFormation stack to be CreateComplete: ResourceNotReady: failed waiting for successful resource state @ 09/30/24 15:30:39.323

Let's try again:
/test pull-cluster-api-provider-aws-e2e

@richardcase
Copy link
Member

STEP: Failed to create CloudFormation stack in attempt 8: failed to create CF stack: failed to wait for AWS CloudFormation stack to be CreateComplete: ResourceNotReady: failed waiting for successful resource state @ 09/30/24 15:30:39.323

Let's try again: /test pull-cluster-api-provider-aws-e2e

If it fails again we can see what AWS account is using and then investigate.

@r4f4
Copy link
Contributor Author

r4f4 commented Oct 2, 2024

I suspect the timeout failures in the e2e are due to #5131

@jonatasbaldin
Copy link

Just for more context, I'm trying to deploy a AWS Cluster on sa-east-1 from a local Kind cluster and I'm stuck for 40 minutes already in the LB DNS name resolution.

Trying an online dig tool I can get the IP address, but it wasn't reflected on my network yet.

Having another method to check the LB status would be much appreciated.

@damdo damdo added this to the v2.7.0 milestone Oct 19, 2024
@damdo
Copy link
Member

damdo commented Oct 23, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member

damdo commented Oct 23, 2024

Kicked the tests again now that CI seems to be working again.
If tests are green are we happy for this to merge? @r4f4 @richardcase @AndiDog ?

Using DNS name resolution as a way to check the load balancer is working
can cause problems that are dependent on the host running CAPA. In some
systems, the DNS resolution can fail with very large TTLs cached DNS
responses, causing very long provisioning times.

Instead of DNS resolution, let's use the AWS API to check for the load
balancer "active" state. Waiting for resolvable DNS names should be left
for the clients to do.
@r4f4
Copy link
Contributor Author

r4f4 commented Oct 23, 2024

Update: rebased on top of current main.

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member

damdo commented Oct 23, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member

damdo commented Oct 24, 2024

The EKS e2e failures are not related to this PR, as I can see them In a no-op PR as well here.

@r4f4
Copy link
Contributor Author

r4f4 commented Oct 25, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

Both sets of e2e are passing and this only applies to ELBv2, so i think

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2024
@richardcase
Copy link
Member

For lgtm:

/assign damdo
/assign nrb
/assign AndiDog

@damdo
Copy link
Member

damdo commented Oct 26, 2024

Considering the extensive testing conducted by @r4f4 and team, our CI being happy with this and in general the change introducing more robust and reliable checking than a DNS based one
(Also it looks like @AndiDog was happy with it here: #5093 (comment))

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2024
@k8s-ci-robot k8s-ci-robot merged commit 5b4f7c1 into kubernetes-sigs:main Oct 26, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load Balancer DNS resolution takes too long
8 participants