TRT-1576: Fail if operator has Available=False unless in upgrade window #28735

DennisPeriquet · 2024-04-23T11:23:57Z

For this test: [bz-%v] clusteroperator/%v should not change condition/Available]:

For non-upgrade jobs, fail when operator goes to Available=False
For upgrade-jobs, fail when operator goes to Available=False unless it's during an upgrade window and the condition lasts for less than 10 minutes.

Once the PR where storage operator stops reporting Available status merges, we can remove the exception for it.

openshift-ci · 2024-04-23T11:25:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DennisPeriquet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [DennisPeriquet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

DennisPeriquet · 2024-04-23T15:35:44Z

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

This will see if my new exception allows the upgrade job to pass despite the single storage operator replica.

openshift-ci · 2024-04-23T15:36:05Z

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/272b5a20-0187-11ef-95a0-20b3d6d376a7-0

DennisPeriquet · 2024-04-23T17:10:27Z

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

retry because the last one didn't really run

openshift-ci · 2024-04-23T17:10:32Z

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/61bc6960-0194-11ef-8313-791cce82a878-0

openshift-trt-bot · 2024-04-25T19:07:02Z

Job Failure Risk Analysis for sha: 63d0936

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd	IncompleteTests Tests for this run (16) are below the historical average (536): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot · 2024-04-26T06:12:03Z

Job Failure Risk Analysis for sha: 3014822

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd	IncompleteTests Tests for this run (25) are below the historical average (531): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot · 2024-04-26T15:07:07Z

Job Failure Risk Analysis for sha: d950634

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-csi	High [OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace This test has passed 100.00% of 25 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-csi'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade	Medium [OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace This test has passed 96.70% of 818 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-upgrade' 'periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade'] in the last 14 days.

openshift-trt-bot · 2024-04-26T23:08:41Z

Job Failure Risk Analysis for sha: 2e4493a

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade	High [sig-apps] job-upgrade This test has passed 100.00% of 32 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	Low [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available This test has passed 0.00% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. Open Bugs Single short-lived operand blip shouldn't cause authentication operator Available=False --- [bz-Storage] clusteroperator/storage should not change condition/Available This test has passed 0.00% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. Open Bugs Setup new vsphere informing job --- [sig-arch] events should not repeat pathologically for ns/openshift-etcd-operator This test has passed 51.61% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available This test has passed 1.61% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- Showing 4 of 12 test results

DennisPeriquet · 2024-04-29T11:45:57Z

/test unit

DennisPeriquet · 2024-04-29T12:53:55Z

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

openshift-ci · 2024-04-29T12:54:36Z

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8a3d2950-0627-11ef-99cb-168bfde7d9b7-0

openshift-trt-bot · 2024-04-29T22:41:39Z

Job Failure Risk Analysis for sha: 80a02e7

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade	High [sig-apps] job-upgrade This test has passed 100.00% of 23 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	Low [bz-Storage] clusteroperator/storage should not change condition/Available This test has passed 0.00% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. Open Bugs Setup new vsphere informing job --- [bz-Routing] clusteroperator/ingress should not change condition/Available This test has passed 0.00% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- [bz-Image Registry] clusteroperator/image-registry should not change condition/Available This test has passed 22.22% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. Open Bugs CI: fail update suite if any ClusterOperator go Available=False outside of updates --- [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available This test has passed 2.22% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- Showing 4 of 11 test results

DennisPeriquet · 2024-04-30T01:24:30Z

/test unit

DennisPeriquet · 2024-04-30T01:24:48Z

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

openshift-ci · 2024-04-30T01:24:55Z

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6ff37c20-0690-11ef-86e4-c1c128b91d20-0

openshift-ci-robot · 2024-04-30T19:20:17Z

@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Fail the [bz-%v] clusteroperator/%v should not change condition/Available] test for operators when Available=False outside of any upgrade window.

Add an exception for storage operator since it has only one replica.

This will give me a list of failures to look into. From the list of failures, we can see if there are already Jiras and decide if we want to add exceptions. Then, we'll update the PR with exceptions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-30T19:26:38Z

@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

For this test: [bz-%v] clusteroperator/%v should not change condition/Available]:

For non-upgrade jobs, fail when operator goes to Available=False

For upgrade-jobs, fail when operator goes to Available=False unless it's during an upgrade window and the condition lasts for less than 10 minutes.

Once the PR where storage operator stops reporting Available status merges, we can remove the exception for it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

…here with 1 replica

openshift-trt-bot · 2024-05-02T00:39:45Z

Job Failure Risk Analysis for sha: efde445

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade	Low [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available This test has passed 61.54% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days. --- [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 61.54% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days. Open Bugs control-plane-machine-set goes Available=False with UnavailableReplicas during updates

DennisPeriquet · 2024-05-03T14:29:16Z

/test e2e-agnostic-ovn-cmd

DennisPeriquet · 2024-05-05T01:36:10Z

/test verify

DennisPeriquet · 2024-05-05T01:37:06Z

/test e2e-aws-ovn-cgroupsv2

openshift-trt-bot · 2024-05-05T04:05:24Z

Job Failure Risk Analysis for sha: b8aec3c

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	Low [bz-openshift-controller-manager] clusteroperator/openshift-controller-manager should not change condition/Available This test has passed 6.25% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- [bz-Image Registry] clusteroperator/image-registry should not change condition/Available This test has passed 23.44% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. Open Bugs CI: fail update suite if any ClusterOperator go Available=False outside of updates --- [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available This test has passed 3.12% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available This test has passed 10.94% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- Showing 4 of 11 test results

DennisPeriquet · 2024-05-05T23:07:52Z

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

openshift-ci · 2024-05-05T23:07:55Z

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4d404fc0-0b34-11ef-921f-8306786e2a9d-0

DennisPeriquet · 2024-05-06T12:38:00Z

re: the last /payload with vsphere:


: [bz-Image Registry] clusteroperator/image-registry should not change condition/Available | 1h34m34s
-- | --
{  4 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:
May 06 00:30:25.569 E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created May 06 00:30:25.569 - 51s   E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created 
May 06 00:50:27.986 E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created May 06 00:50:27.986 - 50s   E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created

Those two events happened within the upgrade window (but the logs indicate no replicas, which I'm betting is why the test failed):

$ cat e2e-events_20240506-000613.json | jq '.items[] | select(.source == "KubeEvent" and .locator.keys.clusterversion? == "cluster")| "\(.from) \(.to) \(.message.reason)"'
"2024-05-06T00:07:00Z 2024-05-06T00:07:00Z UpgradeStarted"
"2024-05-06T00:58:25Z 2024-05-06T00:58:25Z UpgradeVersion"
"2024-05-06T00:58:25Z 2024-05-06T00:58:25Z UpgradeComplete"

wking · 2024-05-06T21:06:29Z

re: the last /payload with vsphere:

I'm not clear on why that run has an Available=False image-registry while we don't have any exceptions in place around that component today besides a single-replica carve-out. This wasn't a single-control-plane-node test:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/ar
tifacts/e2e-vsphere-ovn-upgrade/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata.name'
ci-op-6wykcgk2-d2645-7nlts-master-0
ci-op-6wykcgk2-d2645-7nlts-master-1
ci-op-6wykcgk2-d2645-7nlts-master-2
ci-op-6wykcgk2-d2645-7nlts-worker-0-6bdsp
ci-op-6wykcgk2-d2645-7nlts-worker-0-8c5wm
ci-op-6wykcgk2-d2645-7nlts-worker-0-kxfhn

And the cluster was configured for highly-available infrastructure (which includes the registry):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/artifacts/e2e-vsphere-ovn-upgrade/gather-must-gather/artifacts/must-gather.tar | tar -xOz registry-apps-build02-vmc-ci-openshift-org-ci-op-6wykcgk2-stable-sha256-e7b33149e705570ebcdcebe24c57af8336229175099fb5d53100330fd61015f1/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml | yaml2json | jq -r .status.infrastructureTopology
HighlyAvailable

And yet:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/artifacts/e2e-vsphere-ovn-upgrade/gather-extra/artifacts/deployments.json | jq -c '.items[] | select(.metadata.name == "image-registry").spec | {replicas, strategy}'
{"replicas":1,"strategy":{"type":"Recreate"}}

I don't think the registry operator should be trying to wake the admin from sleep with an Available=False ClusterOperator condition, when it is configuring 1 replica and a Recreate strategy (which makes continual availability impossible). Either the registry operator should configure its operand to be more available (in line with infrastructureTopology: HighlyAvailable), or it should accept that 1 Recreate pod will not be highly available and not alarm anyone on a brief, expected pod-handoff gap to match the Available godocs contract.

[edit: Ah, looks like the 1-replicas may be expected, and the Available=False noise is getting tracked in OCPBUGS-22382]

dgoodwin · 2024-05-06T12:42:11Z

pkg/monitor/monitorapi/types.go

+ UpgradeStartedReason IntervalReason = "UpgradeStarted"
+ UpgradeVersionReason IntervalReason = "UpgradeVersion"
+ UpgradeRollbackReason IntervalReason = "UpgradeRollback"
+ UpgradeCompleteReason IntervalReason = "UpgradeComplete"


Could you plug these into spots where they're used in the code, test/e2e/upgrade/upgrade.go looks to be where they're recorded. pkg/monitortestlibrary/platformidentification/upgrade.go also good to hit, and pkg/monitortests/node/legacynodemonitortests/kubelet.go.

dgoodwin · 2024-05-06T12:46:53Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators_test.go

+// cat e2e-events_20240502-205107.json | jq '.items[] | \
+// select(.source == "KubeEvent" and .locator.keys.clusterversion? == "cluster")| \
+// "\(.from) \(.to) \(.message.reason)"'
+func make_standard_upgrade_event_list(events []string) monitorapi.Intervals {


Code style nit, that should be makeStandardUpgradeEventList. Possibly drop the word standard.

Lets not pass in strings that we then parse into timestamps and interval reasons when we're fully in control of the inputs. I would either (a) Use the IntervalBuilder to construct a minimal interval, or (b) make a buildUpgradeInterval(reason, timestamp) function that returns an interval.

Likely you don't need this function anymore, your vars below would be like:

standardEventList := make_standard_upgrade_event_list([]string{
buildUpdateInterval(types.UpgradeCompleted, time.Date(2024, 5, 1, 12, 51, 09, time.UTC()),
})

dgoodwin · 2024-05-06T12:53:02Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators_test.go

+ eventInterval monitorapi.Interval
+ }
+
+ test1_outside := monitorapi.Interval{


Please put all of these inputs with the tests that use them. If they're reused across multiple tests then they would make sense here, but otherwise probably easier to read if they're with the test definition.

dgoodwin · 2024-05-06T12:53:49Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators_test.go

+ want bool
+ }{
+ {
+ name: "Test 1a: single upgrade window, interval not within",


I would ditch the identifiers (Test 1a) on the tests, not a convention we use in the code base afaik and will be weird to maintain as we inject / remove tests. Let the description be the identifier.

dgoodwin · 2024-05-06T12:55:50Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators_test.go

+ }
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ if got := isInUpgradeWindow(tt.args.eventList, tt.args.eventInterval); got != tt.want {


Double checking got and tt.want here, and then double printing the output. This can just be:

got := isInUpgradeWindow(tt.args.eventList, tt.args.eventInterval)
assert.Equal(t, tt.want, got, "unexpected result from isInUpgradeWindow")

Testify will show you the difference automatically.

dgoodwin · 2024-05-07T16:51:00Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

+ }
+ replicaCount, _ := checkReplicas(namespace, operator, clientConfig)
+ if replicaCount == 1 {
+ return fmt.Sprintf("%s has only single replica, but Available=False is within an upgrade window and is for less than 10 minutes", operator), nil


Is this single replica a thing for storage? Is there a bug that should be referenced like the above?

dgoodwin · 2024-05-07T16:51:55Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

+ }
+
+ reason := string(event.Message.Reason)
+ if reason == "UpgradeStarted" || reason == "UpgradeRollback" {


Use the consts you defined.

dgoodwin · 2024-05-07T16:57:25Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

+ From: event.From,
+ To: event.To,
+ }
+ }


Log here, if currentWindow was nil, we should log a warning that we saw an upgrade complete that we didn't see start, a very strange case.

dgoodwin · 2024-05-07T17:04:12Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

+ endInterval *monitorapi.Interval
+ }
+
+ var upgradeWindows []*upgradeWindowHolder


I'll leave this for you to decide, but I think an actual interval for the overall upgrade window using the logic here would be generally useful. That would mean a new minimal monitortest that just creates some new intervals with a new Source like UpgradeWindow and sets the from-to appropriately. Then this function could be scanning all intervals for the UpgradeWindows, see if our interval is within one, and we could chart the overall upgrade start/end times.

dgoodwin · 2024-05-07T17:05:08Z

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

+ From: event.From,
+ To: event.To,
+ }
+ }


I'm not sure if your assumption is correct, did you confirm that UpgradeRollback is not possible after UpgradeComplete? I would expect it to be but I don't know for sure. If so you'd want to handle currentWindow == nil here and open a new one?

openshift-trt-bot · 2024-05-09T00:57:28Z

Job Failure Risk Analysis for sha: 3c3e6db

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node	High [sig-network] can collect pod-to-host poller pod logs This test has passed 100.00% of 58 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node'] in the last 14 days. --- [sig-network] can collect host-to-host poller pod logs This test has passed 100.00% of 58 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node'] in the last 14 days.

openshift-ci · 2024-05-09T04:57:59Z

@DennisPeriquet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ipi-sdn	`fe7305c`	link	false	`/test e2e-metal-ipi-sdn`
ci/prow/e2e-metal-ipi-ovn-ipv6	`fe7305c`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-aws-ovn-single-node-serial	`fe7305c`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-aws-ovn-single-node-upgrade	`fe7305c`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-aws-ovn-cgroupsv2	`fe7305c`	link	false	`/test e2e-aws-ovn-cgroupsv2`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt-bot · 2024-05-09T05:07:47Z

Job Failure Risk Analysis for sha: fe7305c

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	Low [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available This test has passed 0.00% of 73 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-serial' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available This test has passed 0.00% of 73 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-serial' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- [bz-openshift-controller-manager] clusteroperator/openshift-controller-manager should not change condition/Available This test has passed 5.48% of 73 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-serial' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available This test has passed 1.37% of 73 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-serial' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days. --- Showing 4 of 10 test results

openshift-ci bot requested review from deads2k and soltysh April 23, 2024 11:25

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2024

DennisPeriquet force-pushed the excepted_failures1 branch from 63d0936 to 3014822 Compare April 26, 2024 01:54

DennisPeriquet changed the title ~~DO NOT MERGE: See how many jobs fail with Degraded=True and Available=False~~ DO NOT MERGE: See how many jobs fail with Available=False Apr 26, 2024

DennisPeriquet mentioned this pull request Apr 29, 2024

TRT-1599: Add exception for cluster-storage-operator for Available=False during upgrades #28743

Closed

DennisPeriquet changed the title ~~DO NOT MERGE: See how many jobs fail with Available=False~~ TRT-1576: Fail if operator has Available=False unless in upgrade window Apr 30, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 30, 2024

DennisPeriquet added 4 commits May 1, 2024 14:29

add exception for vsphere due to single replica

cf4261a

See how many jobs fail with Degraded=True and Available=False

6f2b9ef

If Available=False outside of upgrade window, we fail

200a800

Allow transitions during upgrade for cluster-storage-operator for vsp…

d16be54

…here with 1 replica

DennisPeriquet added 5 commits May 1, 2024 14:29

we will work on TRT-1575 separately

6e95796

handle single replica operators during upgrade window

cba84df

for cases besides single replica, attempt to match exceptions

b158c9e

Rearrage since all exceptions refer to Available=False

beeb5d6

Honor Available=False exceptions only if not in upgrade window

efde445

DennisPeriquet force-pushed the excepted_failures1 branch from 696a8b8 to efde445 Compare May 1, 2024 21:28

make unit tests for the isInUpgradeWindow fix bugs found

89a3143

DennisPeriquet force-pushed the excepted_failures1 branch from 657ec8b to 89a3143 Compare May 2, 2024 19:53

refactor to make unit test smaller, readable, extendable

b8aec3c

gofmt

676bb08

dgoodwin reviewed May 7, 2024

View reviewed changes

DennisPeriquet added 4 commits May 8, 2024 13:25

review comments 1

cc302fd

inline the testcases with the test

a15769e

fixup to make more readable

5ff612f

add Jira for image-registry single replica exception

dcaf8e1

Revise upgrade window logic to catch all cases; update unit tests

fe7305c

DennisPeriquet force-pushed the excepted_failures1 branch from 3c3e6db to fe7305c Compare May 9, 2024 01:48

TRT-1576: Fail if operator has Available=False unless in upgrade window #28735

Are you sure you want to change the base?

TRT-1576: Fail if operator has Available=False unless in upgrade window #28735

Conversation

DennisPeriquet commented Apr 23, 2024 • edited

openshift-ci bot commented Apr 23, 2024

DennisPeriquet commented Apr 23, 2024

openshift-ci bot commented Apr 23, 2024

DennisPeriquet commented Apr 23, 2024

openshift-ci bot commented Apr 23, 2024

openshift-trt-bot commented Apr 25, 2024

openshift-trt-bot commented Apr 26, 2024

openshift-trt-bot commented Apr 26, 2024

openshift-trt-bot commented Apr 26, 2024

DennisPeriquet commented Apr 29, 2024

DennisPeriquet commented Apr 29, 2024

openshift-ci bot commented Apr 29, 2024

openshift-trt-bot commented Apr 29, 2024

DennisPeriquet commented Apr 30, 2024

DennisPeriquet commented Apr 30, 2024

openshift-ci bot commented Apr 30, 2024

openshift-ci-robot commented Apr 30, 2024 • edited by openshift-ci bot

openshift-ci-robot commented Apr 30, 2024 • edited by openshift-ci bot

openshift-trt-bot commented May 2, 2024

DennisPeriquet commented May 3, 2024

DennisPeriquet commented May 5, 2024

DennisPeriquet commented May 5, 2024

openshift-trt-bot commented May 5, 2024

DennisPeriquet commented May 5, 2024

openshift-ci bot commented May 5, 2024

DennisPeriquet commented May 6, 2024

wking commented May 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-trt-bot commented May 9, 2024

openshift-ci bot commented May 9, 2024

openshift-trt-bot commented May 9, 2024

DennisPeriquet commented Apr 23, 2024 •

edited

openshift-ci-robot commented Apr 30, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 30, 2024 •

edited by openshift-ci bot

wking commented May 6, 2024 •

edited