New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRT-1576: Fail if operator has Available=False unless in upgrade window #28735
base: master
Are you sure you want to change the base?
TRT-1576: Fail if operator has Available=False unless in upgrade window #28735
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DennisPeriquet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade This will see if my new exception allows the upgrade job to pass despite the single storage operator replica. |
@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/272b5a20-0187-11ef-95a0-20b3d6d376a7-0 |
/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade retry because the last one didn't really run |
@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/61bc6960-0194-11ef-8313-791cce82a878-0 |
Job Failure Risk Analysis for sha: 63d0936
|
63d0936
to
3014822
Compare
Job Failure Risk Analysis for sha: 3014822
|
Job Failure Risk Analysis for sha: d950634
|
Job Failure Risk Analysis for sha: 2e4493a
|
/test unit |
/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade |
@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8a3d2950-0627-11ef-99cb-168bfde7d9b7-0 |
Job Failure Risk Analysis for sha: 80a02e7
|
/test unit |
/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade |
@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6ff37c20-0690-11ef-86e4-c1c128b91d20-0 |
@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
696a8b8
to
efde445
Compare
Job Failure Risk Analysis for sha: efde445
|
657ec8b
to
89a3143
Compare
/test e2e-agnostic-ovn-cmd |
/test verify |
/test e2e-aws-ovn-cgroupsv2 |
Job Failure Risk Analysis for sha: b8aec3c
|
/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade |
@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4d404fc0-0b34-11ef-921f-8306786e2a9d-0 |
re: the last /payload with vsphere:
Those two events happened within the upgrade window (but the logs indicate no replicas, which I'm betting is why the test failed):
|
I'm not clear on why that run has an $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/ar
tifacts/e2e-vsphere-ovn-upgrade/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata.name'
ci-op-6wykcgk2-d2645-7nlts-master-0
ci-op-6wykcgk2-d2645-7nlts-master-1
ci-op-6wykcgk2-d2645-7nlts-master-2
ci-op-6wykcgk2-d2645-7nlts-worker-0-6bdsp
ci-op-6wykcgk2-d2645-7nlts-worker-0-8c5wm
ci-op-6wykcgk2-d2645-7nlts-worker-0-kxfhn And the cluster was configured for highly-available infrastructure (which includes the registry): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/artifacts/e2e-vsphere-ovn-upgrade/gather-must-gather/artifacts/must-gather.tar | tar -xOz registry-apps-build02-vmc-ci-openshift-org-ci-op-6wykcgk2-stable-sha256-e7b33149e705570ebcdcebe24c57af8336229175099fb5d53100330fd61015f1/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml | yaml2json | jq -r .status.infrastructureTopology
HighlyAvailable And yet: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/artifacts/e2e-vsphere-ovn-upgrade/gather-extra/artifacts/deployments.json | jq -c '.items[] | select(.metadata.name == "image-registry").spec | {replicas, strategy}'
{"replicas":1,"strategy":{"type":"Recreate"}} I don't think the registry operator should be trying to wake the admin from sleep with an [edit: Ah, looks like the 1-replicas may be expected, and the |
UpgradeStartedReason IntervalReason = "UpgradeStarted" | ||
UpgradeVersionReason IntervalReason = "UpgradeVersion" | ||
UpgradeRollbackReason IntervalReason = "UpgradeRollback" | ||
UpgradeCompleteReason IntervalReason = "UpgradeComplete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you plug these into spots where they're used in the code, test/e2e/upgrade/upgrade.go looks to be where they're recorded. pkg/monitortestlibrary/platformidentification/upgrade.go also good to hit, and pkg/monitortests/node/legacynodemonitortests/kubelet.go.
// cat e2e-events_20240502-205107.json | jq '.items[] | \ | ||
// select(.source == "KubeEvent" and .locator.keys.clusterversion? == "cluster")| \ | ||
// "\(.from) \(.to) \(.message.reason)"' | ||
func make_standard_upgrade_event_list(events []string) monitorapi.Intervals { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style nit, that should be makeStandardUpgradeEventList. Possibly drop the word standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not pass in strings that we then parse into timestamps and interval reasons when we're fully in control of the inputs. I would either (a) Use the IntervalBuilder to construct a minimal interval, or (b) make a buildUpgradeInterval(reason, timestamp) function that returns an interval.
Likely you don't need this function anymore, your vars below would be like:
standardEventList := make_standard_upgrade_event_list([]string{
buildUpdateInterval(types.UpgradeCompleted, time.Date(2024, 5, 1, 12, 51, 09, time.UTC()),
})
eventInterval monitorapi.Interval | ||
} | ||
|
||
test1_outside := monitorapi.Interval{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put all of these inputs with the tests that use them. If they're reused across multiple tests then they would make sense here, but otherwise probably easier to read if they're with the test definition.
want bool | ||
}{ | ||
{ | ||
name: "Test 1a: single upgrade window, interval not within", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would ditch the identifiers (Test 1a) on the tests, not a convention we use in the code base afaik and will be weird to maintain as we inject / remove tests. Let the description be the identifier.
} | ||
for _, tt := range tests { | ||
t.Run(tt.name, func(t *testing.T) { | ||
if got := isInUpgradeWindow(tt.args.eventList, tt.args.eventInterval); got != tt.want { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking got and tt.want here, and then double printing the output. This can just be:
got := isInUpgradeWindow(tt.args.eventList, tt.args.eventInterval)
assert.Equal(t, tt.want, got, "unexpected result from isInUpgradeWindow")
Testify will show you the difference automatically.
} | ||
replicaCount, _ := checkReplicas(namespace, operator, clientConfig) | ||
if replicaCount == 1 { | ||
return fmt.Sprintf("%s has only single replica, but Available=False is within an upgrade window and is for less than 10 minutes", operator), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this single replica a thing for storage? Is there a bug that should be referenced like the above?
} | ||
|
||
reason := string(event.Message.Reason) | ||
if reason == "UpgradeStarted" || reason == "UpgradeRollback" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the consts you defined.
From: event.From, | ||
To: event.To, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log here, if currentWindow was nil, we should log a warning that we saw an upgrade complete that we didn't see start, a very strange case.
endInterval *monitorapi.Interval | ||
} | ||
|
||
var upgradeWindows []*upgradeWindowHolder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this for you to decide, but I think an actual interval for the overall upgrade window using the logic here would be generally useful. That would mean a new minimal monitortest that just creates some new intervals with a new Source like UpgradeWindow and sets the from-to appropriately. Then this function could be scanning all intervals for the UpgradeWindows, see if our interval is within one, and we could chart the overall upgrade start/end times.
From: event.From, | ||
To: event.To, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if your assumption is correct, did you confirm that UpgradeRollback is not possible after UpgradeComplete? I would expect it to be but I don't know for sure. If so you'd want to handle currentWindow == nil here and open a new one?
Job Failure Risk Analysis for sha: 3c3e6db
|
3c3e6db
to
fe7305c
Compare
@DennisPeriquet: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: fe7305c
|
For this test:
[bz-%v] clusteroperator/%v should not change condition/Available]
:Once the PR where storage operator stops reporting Available status merges, we can remove the exception for it.