New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-18534: monitor test fix to wait before connecting to a non-existent dns on PowerVS and IBMCloud platforms #28739
Conversation
@jcho02: This pull request references Jira Issue OCPBUGS-18534, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hi @jcho02. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
/lgtm |
/test e2e-metal-ipi-ovn-ipv6 |
/test e2e-gcp-ovn |
/lgtm |
Job Failure Risk Analysis for sha: 377e5dc
|
/retest-required |
/lgtm |
cmd.Stdout = out | ||
cmd.Stderr = errOut | ||
|
||
if err := cmd.Run(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the run command, but it looks like the expectation is for this to run synchronously. My suggestion would be to set an explicit timeout to cover a case where scheduling the debug pod gets stuck in limbo for whatever reason. If we have to cancel the command upon said timeout, it should be safe to fail out, since being unable to schedule for whatever reason is something of concern. (But there should probably an error message indicating what happened).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh great - point, let me coordinate with Jason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
Overall I'm happy with this patch. I left a note about a potential concern we may want to address. In order for this to be approved, you'll need someone to approve it from this list: https://github.com/openshift/origin/blob/master/OWNERS#L12-L18. If he has time, @deads2k has helped provide very helpful feedback for our changes to disruption tests before. Otherwise, consider asking in slack for a review from #forum-ocp-release-oversight. I think a lot of the folks on this list are on the TRT team. |
/lgtm |
Done |
/retest-required |
} | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not trying to be difficult. Looks like Poll is deprecated. Also I think if you pass in the context from the StartCollection call and use it in PollUntilContextTimeout you won't block if that ctx gets canceled and you can set an explicit max time for your checkHostnameReady function to block for. I haven't tried it but the patch is attached if you want to.
func checkHostnameReady(ctx context.Context, tcpService *corev1.Service, nodeTgt string) error {
oc := exutil.NewCLIForMonitorTest(tcpService.GetObjectMeta().GetNamespace())
err := wait.PollUntilContextTimeout(ctx, 15*time.Second, 60*time.Minute, true, func(ctx context.Context) (bool, error) {
logrus.Debug("Checking load balancer host is active")
stdOut, _, err := oc.AsAdmin().WithoutNamespace().RunInMonitorTest("debug").Args(nodeTgt, "--", "dig", "+short", tcpService.Status.LoadBalancer.Ingress[0].Hostname).Outputs()
if err != nil {
return false, nil
}
output := strings.TrimSpace(stdOut)
if output != "" {
return true, nil
}
logrus.Debug("Waiting for the load balancer to become active")
return false, nil
})
return err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - we can look at it
Signed-off-by: Jason Cho <[email protected]>
err error | ||
) | ||
|
||
wait.PollUntilContextTimeout(ctx, 15*time.Second, 60*time.Minute, true, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking, this is a conscious choice not to capture the err returned from PollUntilContextTime ( could be context closed or timeout, etc.) and only return the last err, if any, returned from oc.AsAdmin call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was intentional to not capture the err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit I'm curious. So this allows for a path where PollUntilContextTimeout returns an error but checkHostNameReady does not and that will allow StartCollection to proceed and presumably error out down the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will error out here right after the checkHostNameReady function call since the service LoadBalancer is not verified
fmt.Fprintf(os.Stderr, "hitting pods through the service's LoadBalancer\n")
timeout := 10 * time.Minute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, thanks for the explanation!
// NewCLIForMonitorTest initializes the CLI and Kube framework helpers | ||
// without a namespace. Should be called outside of a Ginkgo .It() | ||
// function. | ||
func NewCLIForMonitorTest(project string) *CLI { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you needed new non ginko versions of this and Run to support running in the monitor test environment?
Thanks for setting those up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcho02, neisw, prb112, Shilpa-Gokul The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/override ci/prow/e2e-gcp-ovn-builds e2e-gcp-ovn-builds is known to be failing outside of this pr and is being investigated |
@neisw: Overrode contexts on behalf of neisw: ci/prow/e2e-gcp-ovn-builds In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
@jcho02: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
97cb6b3
into
openshift:master
@jcho02: Jira Issue OCPBUGS-18534: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-18534 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cherry-pick release-4.14 release-4.15 |
@prb112: new pull request created: #28775 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.15 |
@prb112: new pull request created: #28776 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build openshift-enterprise-tests-container-v4.17.0-202405071820.p0.g97cb6b3.assembly.stream.el9 for distgit openshift-enterprise-tests. |
Fix included in accepted release 4.16.0-0.nightly-2024-05-08-222442 |
OCPBUGS-18534
Monitor test changes to fix loadbalancer issue in PowerVS platforms.
Test checks if platform is either a PowerVS or IBMCloud cluster and then runs function
checkHostnameReady
where it debugs the master node of the cluster and verifies the hostname is active before hitting through the load balancer