New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.13] OCPBUGS-32208: Run resolv-prepender entirely async #4314
[release-4.13] OCPBUGS-32208: Run resolv-prepender entirely async #4314
Conversation
Currently the resolv-prepender dispatcher script starts the systemd service and then waits for it to complete. This can cause the dispatcher script to time out if the runtimecfg image pull is slow or if resolv.conf does not get populated in a timely fashion (it's not entirely clear to me why the latter happens, but it does). This can cause configure-ovs to time out if there are a large number of interfaces on the system triggering the dispatcher script, such as when there are many VLANs configured. To avoid this, we can stop waiting for the systemd service in the dispatcher script. In fact, there's an argument that we shouldn't wait since we need to be able to handle asynchronous execution anyway for the slow image pull case (which was the entire reason the script was split into a service the way it is). I have found a few possible issues with async execution however: * If we start the service with an empty $DHCP6_FQDN_FQDN value and then later get a new value for that, we may not correctly apply the new value if the service is still running because we only ever "systemd start" the service, which is a noop if the service is already running. * Similarly, if new IP4/6_DOMAINS values come in on a later connection that may not be reflected in the service either. Even though these may sound like the same problem, I mention them separately on purpose because the solutions are different: * For the DHCP6 case, we can move that logic back into the dispatcher script so we will always set the hostname no matter what happens with the prepender code. One could argue that this should be in its own script anyway since it's largely unrelated to resolv.conf. * For the domains case, we do need to restart the service since the domains are involved in resolv.conf generation. However, we do not want to restart the service every time since that may be unnecessary and if we restart in the middle of the image pull it could result in a corrupt image (the whole thing we were trying to avoid by running this as a service in the first place). To avoid problems with restarting the service when we don't want to, I've added logic that only restarts the service if there are changed env values AND the runtimecfg image has already been pulled. This should mean the worst case scenario is that we don't properly set the domains and resolv.conf is temporarily generated with and incorrect search line. This should be resolved the next time any event that triggers the dispatcher script happens.
@openshift-cherrypick-robot: An error was encountered cloning bug for cherrypick for bug OCPBUGS-29400 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details. Full error message.
Post "https://issues.redhat.com/rest/api/2/issue": POST https://issues.redhat.com/rest/api/2/issue giving up after 5 attempt(s)
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test e2e-metal-ipi |
/jira refresh |
@qiowang721: This pull request references Jira Issue OCPBUGS-29400, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retitle [release-4.13] OCPBUGS-32208: Run resolv-prepender entirely async |
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-32208, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@qiowang721: This pull request references Jira Issue OCPBUGS-32208, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@qiowang721: This pull request references Jira Issue OCPBUGS-32208, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
build image via cluster-bot and pre-merge test, passed. reboot time after applying 70 vlans based on bond: sh-5.1# systemd-analyze Startup finished in 1.506s (kernel) + 3.438s (initrd) + 1min 6.555s (userspace) = 1min 11.501s graphical.target reached after 1min 6.530s in userspace. and no error message for bring up connection br-ex: sh-5.1# journalctl -b | grep "NM resolv.conf still empty of nameserver" | wc -l 5 sh-5.1# sh-5.1# journalctl -b | grep "Cannot bring up connection br-ex after 10 attempts" sh-5.1# sh-5.1# journalctl -b | grep "configure-ovs exited with error" sh-5.1# |
/label cherry-pick-approved |
/jira refresh |
@zhaozhanqi: This pull request references Jira Issue OCPBUGS-32208, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@cybertron: This pull request references Jira Issue OCPBUGS-32208, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest-required |
/cherry-pick release-4.12 |
@cybertron: once the present PR merges, I will cherry-pick it on top of release-4.12 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
/test verify |
/retest-required |
/assign @yuqi-zhang |
@cybertron: GitHub didn't allow me to assign the following users: yuqi-zhang. Note that only openshift members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron, openshift-cherrypick-robot, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@openshift-cherrypick-robot: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
2d7a09c
into
openshift:release-4.13
@openshift-cherrypick-robot: Jira Issue OCPBUGS-32208: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-32208 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@cybertron: new pull request created: #4364 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-machine-config-operator-container-v4.13.0-202405141537.p0.g2d7a09c.assembly.stream.el8 for distgit ose-machine-config-operator. |
Fix included in accepted release 4.13.0-0.nightly-2024-05-15-165412 |
This is an automated cherry-pick of #4182
/assign cybertron