Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-18534: monitor test fix to wait before connecting to a non-existent dns on PowerVS and IBMCloud platforms #28739

Merged
merged 2 commits into from May 7, 2024

Conversation

jcho02
Copy link
Contributor

@jcho02 jcho02 commented Apr 24, 2024

OCPBUGS-18534
Monitor test changes to fix loadbalancer issue in PowerVS platforms.

Test checks if platform is either a PowerVS or IBMCloud cluster and then runs function checkHostnameReady where it debugs the master node of the cluster and verifies the hostname is active before hitting through the load balancer

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 24, 2024
@openshift-ci-robot
Copy link

@jcho02: This pull request references Jira Issue OCPBUGS-18534, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

OCPBUGS-18534
Monitor test changes to fix loadbalancer issue in PowerVS platforms.

Test checks if platform is either a PowerVS or IBMCloud cluster and then runs function checkHostnameReady where it debugs the master node of the cluster and verifies the hostname is active before hitting through the load balancer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 24, 2024
Copy link
Contributor

openshift-ci bot commented Apr 24, 2024

Hi @jcho02. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from p0lyn0mial and stlaz April 24, 2024 19:47
@Shilpa-Gokul
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2024
@Shilpa-Gokul
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2024
@Shilpa-Gokul
Copy link
Contributor

/test e2e-metal-ipi-ovn-ipv6

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2024
@Shilpa-Gokul
Copy link
Contributor

/test e2e-gcp-ovn

@Shilpa-Gokul
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2024
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 377e5dc

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-sdn IncompleteTests
Tests for this run (18) are below the historical average (1107): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@prb112
Copy link
Contributor

prb112 commented Apr 30, 2024

/retest-required

@prb112
Copy link
Contributor

prb112 commented Apr 30, 2024

/lgtm

cmd.Stdout = out
cmd.Stderr = errOut

if err := cmd.Run(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with the run command, but it looks like the expectation is for this to run synchronously. My suggestion would be to set an explicit timeout to cover a case where scheduling the debug pod gets stuck in limbo for whatever reason. If we have to cancel the command upon said timeout, it should be safe to fail out, since being unable to schedule for whatever reason is something of concern. (But there should probably an error message indicating what happened).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great - point, let me coordinate with Jason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

@jaypoulz
Copy link
Contributor

Overall I'm happy with this patch. I left a note about a potential concern we may want to address. In order for this to be approved, you'll need someone to approve it from this list: https://github.com/openshift/origin/blob/master/OWNERS#L12-L18.

If he has time, @deads2k has helped provide very helpful feedback for our changes to disruption tests before. Otherwise, consider asking in slack for a review from #forum-ocp-release-oversight. I think a lot of the folks on this list are on the TRT team.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 1, 2024
@prb112
Copy link
Contributor

prb112 commented May 1, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2024
@prb112
Copy link
Contributor

prb112 commented May 1, 2024

If he has time, @deads2k has helped provide very helpful feedback for our changes to disruption tests before. Otherwise, consider asking in slack for a review from #forum-ocp-release-oversight. I think a lot of the folks on this list are on the TRT team.

Done

@prb112
Copy link
Contributor

prb112 commented May 3, 2024

/retest-required

}
}
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not trying to be difficult. Looks like Poll is deprecated. Also I think if you pass in the context from the StartCollection call and use it in PollUntilContextTimeout you won't block if that ctx gets canceled and you can set an explicit max time for your checkHostnameReady function to block for. I haven't tried it but the patch is attached if you want to.

func checkHostnameReady(ctx context.Context, tcpService *corev1.Service, nodeTgt string) error {
	oc := exutil.NewCLIForMonitorTest(tcpService.GetObjectMeta().GetNamespace())

	err := wait.PollUntilContextTimeout(ctx, 15*time.Second, 60*time.Minute, true, func(ctx context.Context) (bool, error) {
		logrus.Debug("Checking load balancer host is active")
		stdOut, _, err := oc.AsAdmin().WithoutNamespace().RunInMonitorTest("debug").Args(nodeTgt, "--", "dig", "+short", tcpService.Status.LoadBalancer.Ingress[0].Hostname).Outputs()
		if err != nil {
			return false, nil
		}
		output := strings.TrimSpace(stdOut)
		if output != "" {
			return true, nil
		}

		logrus.Debug("Waiting for the load balancer to become active")
		return false, nil
	})

	return err
}

origin_PollUntilContextTimeout.patch.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - we can look at it

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 6, 2024
err error
)

wait.PollUntilContextTimeout(ctx, 15*time.Second, 60*time.Minute, true, func(ctx context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, this is a conscious choice not to capture the err returned from PollUntilContextTime ( could be context closed or timeout, etc.) and only return the last err, if any, returned from oc.AsAdmin call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was intentional to not capture the err

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I'm curious. So this allows for a path where PollUntilContextTimeout returns an error but checkHostNameReady does not and that will allow StartCollection to proceed and presumably error out down the line?

Copy link
Contributor Author

@jcho02 jcho02 May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will error out here right after the checkHostNameReady function call since the service LoadBalancer is not verified

fmt.Fprintf(os.Stderr, "hitting pods through the service's LoadBalancer\n")
timeout := 10 * time.Minute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, thanks for the explanation!

// NewCLIForMonitorTest initializes the CLI and Kube framework helpers
// without a namespace. Should be called outside of a Ginkgo .It()
// function.
func NewCLIForMonitorTest(project string) *CLI {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you needed new non ginko versions of this and Run to support running in the monitor test environment?

Thanks for setting those up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

@neisw
Copy link
Contributor

neisw commented May 6, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2024
Copy link
Contributor

openshift-ci bot commented May 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcho02, neisw, prb112, Shilpa-Gokul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 15c3521 and 2 for PR HEAD 60e8717 in total

@neisw
Copy link
Contributor

neisw commented May 6, 2024

/override ci/prow/e2e-gcp-ovn-builds

e2e-gcp-ovn-builds is known to be failing outside of this pr and is being investigated

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@neisw: Overrode contexts on behalf of neisw: ci/prow/e2e-gcp-ovn-builds

In response to this:

/override ci/prow/e2e-gcp-ovn-builds

e2e-gcp-ovn-builds is known to be failing outside of this pr and is being investigated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 75f7e06 and 1 for PR HEAD 60e8717 in total

@prb112
Copy link
Contributor

prb112 commented May 7, 2024

/retest-required

Copy link
Contributor

openshift-ci bot commented May 7, 2024

@jcho02: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-serial 60e8717 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-single-node-upgrade 60e8717 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-openstack-ovn 60e8717 link false /test e2e-openstack-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 97cb6b3 into openshift:master May 7, 2024
20 of 23 checks passed
@openshift-ci-robot
Copy link

@jcho02: Jira Issue OCPBUGS-18534: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-18534 has been moved to the MODIFIED state.

In response to this:

OCPBUGS-18534
Monitor test changes to fix loadbalancer issue in PowerVS platforms.

Test checks if platform is either a PowerVS or IBMCloud cluster and then runs function checkHostnameReady where it debugs the master node of the cluster and verifies the hostname is active before hitting through the load balancer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@prb112
Copy link
Contributor

prb112 commented May 7, 2024

/cherry-pick release-4.14 release-4.15

@openshift-cherrypick-robot

@prb112: new pull request created: #28775

In response to this:

/cherry-pick release-4.14 release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@prb112
Copy link
Contributor

prb112 commented May 7, 2024

/cherry-pick release-4.15

@openshift-cherrypick-robot

@prb112: new pull request created: #28776

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-tests-container-v4.17.0-202405071820.p0.g97cb6b3.assembly.stream.el9 for distgit openshift-enterprise-tests.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-05-08-222442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants