-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛: elbv2: wait for LB active state instead of resolving DNS name #5093
Conversation
Hi @r4f4. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
In my local tests I see:
I'm going to run Openshift e2e CI tests on this change to check for regressions. /cc @patrickdillon |
@r4f4: GitHub didn't allow me to request PR reviews from the following users: patrickdillon. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
* Fix for DNS name resolution too early: kubernetes-sigs/cluster-api-provider-aws#5093
I don't see any install issues in the e2e tests of openshift/installer#8825 |
@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh. |
/ok-to-test |
/test ? |
@r4f4: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-provider-aws-e2e pull-cluster-api-provider-aws-e2e-conformance |
Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready. Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases. |
I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back. |
For what's worth, when we were using terraform for provisioning there was no such LB DNS name check. But the tf aws provider does wait for the LB to become active after creation: https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/elbv2/load_balancer.go#L406-L408 |
No install failures in 11 CI jobs: openshift/installer#8927. |
/assign @kubernetes-sigs/cluster-api-provider-aws-maintainers for approval |
Sorry, posted that PR assignment to the wrong PR 🙄 |
665e6e5
to
a880668
Compare
Update: rebased on top of master so I could rebase openshift/installer#8927 and re-run the Openshift e2e tests. |
@AndiDog I haven't seen a single regression in the OCP e2e tests so far. Anything else I can do to move this along? |
Right, Chesterton's Fence says we need to understand it before we tear it down. We discussed the motivation in this slack thread.. There is not a complete consensus but to me the key takeaways/concerns are:
I want to reiterate that the current DNS check only ensures that the DNS is resolvable from the manager--not the nodes. In our tests the nodes are always able to resolve the DNS quickly, but in some edge cases (like the DNS resolver for a distinguished engineer's local ISP or AWS GovCloud simulators) DNS can be very slow to propagate.
So in our (openshift) case removing this check speeds up--not delays--the cluster readiness, because we do not block that the manager can resolve the DNS until it is needed. According to @vincepri in the thread, CAPI performs a similar check.
We use a different bootstrapping process so I'm not as familiar with kubeadm as I would like, but the only potential for failure would be if the cluster coming up depends on the manager resolving the LB DNS name. I would also point out that this current code arbitrarily assumes clusters use the LB DNS name, where it should technically be possible to create a DNS record pointing to the LB IPs (NLB IPs are static for their lifetime). That is an edge case, but we are indeed working on exactly that scenario.
We are extensively testing this in openshift, but again our bootstrapping & install process has some differences with the standard CAPI flow. So far we have only found that the DNS check causes problems rather than preventing them. |
/test pull-cluster-api-provider-aws-e2e |
Let's try again: |
If it fails again we can see what AWS account is using and then investigate. |
I suspect the timeout failures in the e2e are due to #5131 |
Just for more context, I'm trying to deploy a AWS Cluster on Trying an online Having another method to check the LB status would be much appreciated. |
/test pull-cluster-api-provider-aws-e2e |
Kicked the tests again now that CI seems to be working again. |
Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times. Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.
a880668
to
32ebcce
Compare
Update: rebased on top of current main. /test pull-cluster-api-provider-aws-e2e |
/test pull-cluster-api-provider-aws-e2e-eks |
The EKS e2e failures are not related to this PR, as I can see them In a no-op PR as well here. |
/test pull-cluster-api-provider-aws-e2e-eks |
Both sets of e2e are passing and this only applies to ELBv2, so i think /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: richardcase The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
For lgtm: /assign damdo |
Considering the extensive testing conducted by @r4f4 and team, our CI being happy with this and in general the change introducing more robust and reliable checking than a DNS based one /lgtm |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times.
Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #5032
Special notes for your reviewer:
This is an alternative approach to #4976 and #5033.
Checklist:
Release note: