Skip to content

[release-4.14] OCPBUGS-29400: Run resolv-prepender entirely async #4182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

openshift-cherrypick-robot

This is an automated cherry-pick of #4161

/assign mkowalski

Currently the resolv-prepender dispatcher script starts the systemd
service and then waits for it to complete. This can cause the
dispatcher script to time out if the runtimecfg image pull is slow
or if resolv.conf does not get populated in a timely fashion (it's
not entirely clear to me why the latter happens, but it does). This
can cause configure-ovs to time out if there are a large number of
interfaces on the system triggering the dispatcher script, such as
when there are many VLANs configured.

To avoid this, we can stop waiting for the systemd service in the
dispatcher script. In fact, there's an argument that we shouldn't
wait since we need to be able to handle asynchronous execution
anyway for the slow image pull case (which was the entire reason the
script was split into a service the way it is).

I have found a few possible issues with async execution however:
* If we start the service with an empty $DHCP6_FQDN_FQDN value and
  then later get a new value for that, we may not correctly apply
  the new value if the service is still running because we only
  ever "systemd start" the service, which is a noop if the service
  is already running.
* Similarly, if new IP4/6_DOMAINS values come in on a later
  connection that may not be reflected in the service either.

Even though these may sound like the same problem, I mention them
separately on purpose because the solutions are different:
* For the DHCP6 case, we can move that logic back into the dispatcher
  script so we will always set the hostname no matter what happens
  with the prepender code. One could argue that this should be in
  its own script anyway since it's largely unrelated to resolv.conf.
* For the domains case, we do need to restart the service since the
  domains are involved in resolv.conf generation. However, we do not
  want to restart the service every time since that may be unnecessary
  and if we restart in the middle of the image pull it could result
  in a corrupt image (the whole thing we were trying to avoid by
  running this as a service in the first place).

  To avoid problems with restarting the service when we don't want to,
  I've added logic that only restarts the service if there are
  changed env values AND the runtimecfg image has already been pulled.
  This should mean the worst case scenario is that we don't properly
  set the domains and resolv.conf is temporarily generated with and
  incorrect search line. This should be resolved the next time any
  event that triggers the dispatcher script happens.
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: Jira Issue OCPBUGS-28909 has been cloned as Jira Issue OCPBUGS-29400. Will retitle bug to link to clone.

Jira Issue OCPBUGS-28910 has been cloned as Jira Issue OCPBUGS-29401. Will retitle bug to link to clone.
/retitle [release-4.14] OCPBUGS-29400,OCPBUGS-29401: Run resolv-prepender entirely async

In response to this:

This is an automated cherry-pick of #4161

/assign mkowalski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot changed the title [release-4.14] OCPBUGS-28909,OCPBUGS-28910: Run resolv-prepender entirely async [release-4.14] OCPBUGS-29400,OCPBUGS-29401: Run resolv-prepender entirely async Feb 13, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Feb 13, 2024
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-29400, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.z) matches configured target version for branch (4.14.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-28909 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-28909 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
  • bug has dependents

Requesting review from QA contact:
/cc @zhaozhanqi

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-29401, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.z) matches configured target version for branch (4.14.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-28910 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-28910 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
  • bug has dependents

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is an automated cherry-pick of #4161

/assign mkowalski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Feb 13, 2024
@mkowalski
Copy link
Contributor

/lgtm

@mkowalski
Copy link
Contributor

/cc @cybertron
The 4.15 will have some time to stabilize anyway before this ever merges

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 13, 2024
@zhaozhanqi
Copy link

@cybertron @mkowalski need backport-risk-assessed label to merge this

@mkowalski
Copy link
Contributor

/label backport-risk-assessed
/hold

We need to give it a few weeks to stabilize on 4.15, then we can merge

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2024
Copy link
Contributor

openshift-ci bot commented Feb 19, 2024

@mkowalski: Can not set label backport-risk-assessed: Must be member in one of these teams: [openshift-patch-managers]

In response to this:

/label backport-risk-assessed
/hold

We need to give it a few weeks to stabilize on 4.15, then we can merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cybertron
Copy link
Member

/retitle [release-4.14] OCPBUGS-29400: Run resolv-prepender entirely async
/label backport-risk-assessed

Just removing the second bug as it's not necessary to track the backports in multiple places.

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Mar 1, 2024
@openshift-ci openshift-ci bot changed the title [release-4.14] OCPBUGS-29400,OCPBUGS-29401: Run resolv-prepender entirely async [release-4.14] OCPBUGS-29400: Run resolv-prepender entirely async Mar 1, 2024
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-29400, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.z) matches configured target version for branch (4.14.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-28909 is in the state Closed (Done-Errata), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-28909 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0, 4.15.z
  • bug has dependents

Requesting review from QA contact:
/cc @qiowang721

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is an automated cherry-pick of #4161

/assign mkowalski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from qiowang721 March 1, 2024 22:00
@qiowang721
Copy link

build image via cluster-bot and pre-merge test, passed.
/label qe-approved
/label cherry-pick-approve

Version:

% oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.ci.test-2024-03-04-024747-ci-ln-4rz3bdb-latest   True        False         37m     Cluster version is 4.14.0-0.ci.test-2024-03-04-024747-ci-ln-4rz3bdb-latest
% oc get csv -n openshift-nmstate
NAME                                              DISPLAY                       VERSION               REPLACES   PHASE
kubernetes-nmstate-operator.4.14.0-202402091039   Kubernetes NMState Operator   4.14.0-202402091039              Succeeded

Steps:

  1. install knmstate operator
  2. apply nncp to create 70 vlans based on bond on one worker node
  desiredState:
    interfaces:
      - description: vlan using bond1
        name: bond1.101
        state: up
        type: vlan
        vlan:
          base-iface: bond1
          id: 101
      - description: vlan using bond1
        name: bond1.102
        state: up
        type: vlan
        vlan:
          base-iface: bond1
          id: 102
      ... ...
  1. reboot the worker node
  2. check the boot time, it's < 2 mins
sh-5.1# systemd-analyze 
Startup finished in 1.438s (kernel) + 3.163s (initrd) + 1min 2.767s (userspace) = 1min 7.369s 
graphical.target reached after 1min 2.751s in userspace.
  1. check the node log, no error message for bring up connection br-ex
sh-5.1# journalctl -b | grep "NM resolv.conf still empty of nameserver" | wc -l
5
sh-5.1# journalctl -b | grep "Cannot bring up connection br-ex after 10 attempts"
sh-5.1# journalctl -b | grep "configure-ovs exited with error"

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 4, 2024
Copy link
Contributor

openshift-ci bot commented Mar 4, 2024

@qiowang721: The label(s) `/label cherry-pick-approve

cannot be applied. These labels are supported:acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, no-qe, downstream-change-needed, rebase/manual, cluster-config-api-changed, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labelsorlabels -> restricted_labelsinplugin.yaml`?

In response to this:

build image via cluster-bot and pre-merge test, passed.
/label qe-approved
/label cherry-pick-approve

Version:

% oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.ci.test-2024-03-04-024747-ci-ln-4rz3bdb-latest   True        False         37m     Cluster version is 4.14.0-0.ci.test-2024-03-04-024747-ci-ln-4rz3bdb-latest
% oc get csv -n openshift-nmstate
NAME                                              DISPLAY                       VERSION               REPLACES   PHASE
kubernetes-nmstate-operator.4.14.0-202402091039   Kubernetes NMState Operator   4.14.0-202402091039              Succeeded

Steps:

  1. install knmstate operator
  2. apply nncp to create 70 vlans based on bond on one worker node
 desiredState:
   interfaces:
     - description: vlan using bond1
       name: bond1.101
       state: up
       type: vlan
       vlan:
         base-iface: bond1
         id: 101
     - description: vlan using bond1
       name: bond1.102
       state: up
       type: vlan
       vlan:
         base-iface: bond1
         id: 102
     ... ...
  1. reboot the worker node
  2. check the boot time, it's < 2 mins
sh-5.1# systemd-analyze 
Startup finished in 1.438s (kernel) + 3.163s (initrd) + 1min 2.767s (userspace) = 1min 7.369s 
graphical.target reached after 1min 2.751s in userspace.
  1. check the node log, no error message for bring up connection br-ex
sh-5.1# journalctl -b | grep "NM resolv.conf still empty of nameserver" | wc -l
5
sh-5.1# journalctl -b | grep "Cannot bring up connection br-ex after 10 attempts"
sh-5.1# journalctl -b | grep "configure-ovs exited with error"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@qiowang721
Copy link

/label cherry-pick-approved

Copy link
Contributor

openshift-ci bot commented Mar 4, 2024

@qiowang721: Can not set label cherry-pick-approved: Must be member in one of these teams: []

In response to this:

/label cherry-pick-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zhaozhanqi
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 4, 2024
@cybertron
Copy link
Member

/retest-required

Not relevant to AWS job.

@cybertron
Copy link
Member

/hold cancel
/assign @yuqi-zhang

This has been in for weeks and no problems reported with it. I think we're good to move forward.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2024
Copy link
Contributor

openshift-ci bot commented Apr 1, 2024

@cybertron: GitHub didn't allow me to assign the following users: yuqi-zhang.

Note that only openshift members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/hold cancel
/assign @yuqi-zhang

This has been in for weeks and no problems reported with it. I think we're good to move forward.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Apr 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski, openshift-cherrypick-robot, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 22c9320 and 2 for PR HEAD 85e8466 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ffd28d3 and 1 for PR HEAD 85e8466 in total

Copy link
Contributor

openshift-ci bot commented Apr 3, 2024

@openshift-cherrypick-robot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 85e8466 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-ovn-rt-upgrade 85e8466 link false /test e2e-gcp-ovn-rt-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cybertron
Copy link
Member

/retest-required

This doesn't affect aws.

@openshift-merge-bot openshift-merge-bot bot merged commit 42218f9 into openshift:release-4.14 Apr 4, 2024
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: Jira Issue OCPBUGS-29400: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-29400 has been moved to the MODIFIED state.

In response to this:

This is an automated cherry-pick of #4161

/assign mkowalski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.14.0-202404041612.p0.g42218f9.assembly.stream.el8 for distgit ose-machine-config-operator.
All builds following this will include this PR.

@cybertron
Copy link
Member

/cherry-pick release-4.13

@openshift-cherrypick-robot
Copy link
Author

@cybertron: new pull request created: #4314

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.14.0-0.nightly-2024-04-12-112012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.