OCPBUGS-18534: monitor test fix to wait before connecting to a non-existent dns on PowerVS and IBMCloud platforms #28739

jcho02 · 2024-04-24T19:46:28Z

OCPBUGS-18534
Monitor test changes to fix loadbalancer issue in PowerVS platforms.

Test checks if platform is either a PowerVS or IBMCloud cluster and then runs function checkHostnameReady where it debugs the master node of the cluster and verifies the hostname is active before hitting through the load balancer

openshift-ci-robot · 2024-04-24T19:46:35Z

@jcho02: This pull request references Jira Issue OCPBUGS-18534, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

OCPBUGS-18534
Monitor test changes to fix loadbalancer issue in PowerVS platforms.

Test checks if platform is either a PowerVS or IBMCloud cluster and then runs function checkHostnameReady where it debugs the master node of the cluster and verifies the hostname is active before hitting through the load balancer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-04-24T19:47:39Z

Hi @jcho02. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Shilpa-Gokul · 2024-04-25T05:06:50Z

/ok-to-test

Shilpa-Gokul · 2024-04-25T05:31:13Z

/lgtm

Shilpa-Gokul · 2024-04-29T05:35:42Z

/test e2e-metal-ipi-ovn-ipv6

Shilpa-Gokul · 2024-04-30T04:55:55Z

/test e2e-gcp-ovn

Shilpa-Gokul · 2024-04-30T04:56:03Z

/lgtm

openshift-trt-bot · 2024-04-30T08:05:49Z

Job Failure Risk Analysis for sha: 377e5dc

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-sdn	IncompleteTests Tests for this run (18) are below the historical average (1107): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

prb112 · 2024-04-30T20:17:02Z

/retest-required

prb112 · 2024-04-30T20:17:44Z

/lgtm

jaypoulz · 2024-04-30T23:11:26Z

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

+ cmd.Stdout = out
+ cmd.Stderr = errOut
+
+ if err := cmd.Run(); err != nil {


I'm not super familiar with the run command, but it looks like the expectation is for this to run synchronously. My suggestion would be to set an explicit timeout to cover a case where scheduling the debug pod gets stuck in limbo for whatever reason. If we have to cancel the command upon said timeout, it should be safe to fail out, since being unable to schedule for whatever reason is something of concern. (But there should probably an error message indicating what happened).

Oh great - point, let me coordinate with Jason

Addressed in latest commit

jaypoulz · 2024-04-30T23:17:45Z

Overall I'm happy with this patch. I left a note about a potential concern we may want to address. In order for this to be approved, you'll need someone to approve it from this list: https://github.com/openshift/origin/blob/master/OWNERS#L12-L18.

If he has time, @deads2k has helped provide very helpful feedback for our changes to disruption tests before. Otherwise, consider asking in slack for a review from #forum-ocp-release-oversight. I think a lot of the folks on this list are on the TRT team.

prb112 · 2024-05-01T19:40:08Z

/lgtm

prb112 · 2024-05-01T19:44:46Z

If he has time, @deads2k has helped provide very helpful feedback for our changes to disruption tests before. Otherwise, consider asking in slack for a review from #forum-ocp-release-oversight. I think a lot of the folks on this list are on the TRT team.

Done

prb112 · 2024-05-03T16:41:42Z

/retest-required

neisw · 2024-05-03T17:44:46Z

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

+ }
+ }
+ return nil
+}


Not trying to be difficult. Looks like Poll is deprecated. Also I think if you pass in the context from the StartCollection call and use it in PollUntilContextTimeout you won't block if that ctx gets canceled and you can set an explicit max time for your checkHostnameReady function to block for. I haven't tried it but the patch is attached if you want to.

func checkHostnameReady(ctx context.Context, tcpService *corev1.Service, nodeTgt string) error { oc := exutil.NewCLIForMonitorTest(tcpService.GetObjectMeta().GetNamespace()) err := wait.PollUntilContextTimeout(ctx, 15*time.Second, 60*time.Minute, true, func(ctx context.Context) (bool, error) { logrus.Debug("Checking load balancer host is active") stdOut, _, err := oc.AsAdmin().WithoutNamespace().RunInMonitorTest("debug").Args(nodeTgt, "--", "dig", "+short", tcpService.Status.LoadBalancer.Ingress[0].Hostname).Outputs() if err != nil { return false, nil } output := strings.TrimSpace(stdOut) if output != "" { return true, nil } logrus.Debug("Waiting for the load balancer to become active") return false, nil }) return err }

origin_PollUntilContextTimeout.patch.txt

Sure - we can look at it

Signed-off-by: Jason Cho <[email protected]>

neisw · 2024-05-06T18:23:16Z

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

+ err error
+ )
+
+ wait.PollUntilContextTimeout(ctx, 15*time.Second, 60*time.Minute, true, func(ctx context.Context) (bool, error) {


Just double checking, this is a conscious choice not to capture the err returned from PollUntilContextTime ( could be context closed or timeout, etc.) and only return the last err, if any, returned from oc.AsAdmin call?

Yes it was intentional to not capture the err

I have to admit I'm curious. So this allows for a path where PollUntilContextTimeout returns an error but checkHostNameReady does not and that will allow StartCollection to proceed and presumably error out down the line?

Yes it will error out here right after the checkHostNameReady function call since the service LoadBalancer is not verified

fmt.Fprintf(os.Stderr, "hitting pods through the service's LoadBalancer\n") timeout := 10 * time.Minute

Ahh, thanks for the explanation!

neisw · 2024-05-06T19:25:53Z

test/extended/util/client.go

+// NewCLIForMonitorTest initializes the CLI and Kube framework helpers
+// without a namespace. Should be called outside of a Ginkgo .It()
+// function.
+func NewCLIForMonitorTest(project string) *CLI {


So you needed new non ginko versions of this and Run to support running in the monitor test environment?

Thanks for setting those up.

neisw · 2024-05-06T19:57:21Z

/lgtm

openshift-ci · 2024-05-06T19:59:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcho02, neisw, prb112, Shilpa-Gokul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [neisw]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-05-06T20:18:20Z

/retest-required

Remaining retests: 0 against base HEAD 15c3521 and 2 for PR HEAD 60e8717 in total

neisw · 2024-05-06T23:13:07Z

/override ci/prow/e2e-gcp-ovn-builds

e2e-gcp-ovn-builds is known to be failing outside of this pr and is being investigated

openshift-ci · 2024-05-06T23:13:16Z

@neisw: Overrode contexts on behalf of neisw: ci/prow/e2e-gcp-ovn-builds

In response to this:

/override ci/prow/e2e-gcp-ovn-builds

e2e-gcp-ovn-builds is known to be failing outside of this pr and is being investigated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2024-05-07T11:19:26Z

/retest-required

Remaining retests: 0 against base HEAD 75f7e06 and 1 for PR HEAD 60e8717 in total

prb112 · 2024-05-07T12:08:36Z

/retest-required

openshift-ci · 2024-05-07T13:25:48Z

@jcho02: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-serial	`60e8717`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-aws-ovn-single-node-upgrade	`60e8717`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-openstack-ovn	`60e8717`	link	false	`/test e2e-openstack-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-05-07T16:48:39Z

@jcho02: Jira Issue OCPBUGS-18534: All pull requests linked via external trackers have merged:

openshift/origin#28739

Jira Issue OCPBUGS-18534 has been moved to the MODIFIED state.

In response to this:

OCPBUGS-18534
Monitor test changes to fix loadbalancer issue in PowerVS platforms.

Test checks if platform is either a PowerVS or IBMCloud cluster and then runs function checkHostnameReady where it debugs the master node of the cluster and verifies the hostname is active before hitting through the load balancer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

prb112 · 2024-05-07T17:37:18Z

/cherry-pick release-4.14 release-4.15

openshift-cherrypick-robot · 2024-05-07T17:39:54Z

@prb112: new pull request created: #28775

In response to this:

/cherry-pick release-4.14 release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

prb112 · 2024-05-07T17:43:40Z

/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-05-07T17:44:25Z

@prb112: new pull request created: #28776

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2024-05-07T20:30:59Z

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-tests-container-v4.17.0-202405071820.p0.g97cb6b3.assembly.stream.el9 for distgit openshift-enterprise-tests.
All builds following this will include this PR.

openshift-merge-robot · 2024-05-09T13:23:02Z

Fix included in accepted release 4.16.0-0.nightly-2024-05-08-222442

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 24, 2024

openshift-ci bot requested review from p0lyn0mial and stlaz April 24, 2024 19:47

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2024

openshift-ci bot assigned Shilpa-Gokul Apr 25, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2024

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2024

jcho02 force-pushed the OCPBUGS-18534 branch from e728b8c to 377e5dc Compare April 29, 2024 14:04

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2024

openshift-ci bot assigned prb112 Apr 30, 2024

jaypoulz reviewed Apr 30, 2024

View reviewed changes

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 1, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2024

neisw reviewed May 3, 2024

View reviewed changes

Use PollUntilContextTimeout instead of deprecated Poll

60e8717

Signed-off-by: Jason Cho <[email protected]>

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 6, 2024

neisw reviewed May 6, 2024

View reviewed changes

openshift-ci bot assigned neisw May 6, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2024

openshift-merge-bot bot merged commit 97cb6b3 into openshift:master May 7, 2024
20 of 23 checks passed

openshift-cherrypick-robot mentioned this pull request May 7, 2024

[release-4.14] OCPBUGS-33367: monitor test fix to wait before connecting to a non-existent dns on PowerVS and IBMCloud platforms #28775

Open

openshift-cherrypick-robot mentioned this pull request May 7, 2024

[release-4.15] OCPBUGS-33368: monitor test fix to wait before connecting to a non-existent dns on PowerVS and IBMCloud platforms #28776

Open

OCPBUGS-18534: monitor test fix to wait before connecting to a non-existent dns on PowerVS and IBMCloud platforms #28739

OCPBUGS-18534: monitor test fix to wait before connecting to a non-existent dns on PowerVS and IBMCloud platforms #28739

Conversation

jcho02 commented Apr 24, 2024

openshift-ci-robot commented Apr 24, 2024

openshift-ci bot commented Apr 24, 2024

Shilpa-Gokul commented Apr 25, 2024

Shilpa-Gokul commented Apr 25, 2024

Shilpa-Gokul commented Apr 29, 2024

Shilpa-Gokul commented Apr 30, 2024

Shilpa-Gokul commented Apr 30, 2024

openshift-trt-bot commented Apr 30, 2024

prb112 commented Apr 30, 2024

prb112 commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaypoulz commented Apr 30, 2024

prb112 commented May 1, 2024

prb112 commented May 1, 2024

prb112 commented May 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcho02 May 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neisw commented May 6, 2024

openshift-ci bot commented May 6, 2024

openshift-ci-robot commented May 6, 2024

neisw commented May 6, 2024

openshift-ci bot commented May 6, 2024

openshift-ci-robot commented May 7, 2024

prb112 commented May 7, 2024

openshift-ci bot commented May 7, 2024

openshift-ci-robot commented May 7, 2024

prb112 commented May 7, 2024

openshift-cherrypick-robot commented May 7, 2024

prb112 commented May 7, 2024

openshift-cherrypick-robot commented May 7, 2024

openshift-bot commented May 7, 2024

openshift-merge-robot commented May 9, 2024

jcho02 May 6, 2024 •

edited