-
Notifications
You must be signed in to change notification settings - Fork 110
OCPBUGS-28647: tuned: operand: add support for deferred updates #1019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-28647: tuned: operand: add support for deferred updates #1019
Conversation
@ffromani: This pull request references Jira Issue OCPBUGS-28647, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@ffromani: This pull request references Jira Issue OCPBUGS-28647, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ab08d68
to
7a63c46
Compare
7a63c46
to
30cdab0
Compare
30cdab0
to
5376eac
Compare
f0b7ece
to
2314db4
Compare
/retest |
stil looks as infra issue to me. |
04635db
to
f0873a9
Compare
the PR is now reviewable |
@yanirq in Martin's presence, I think a high-level assessment if this does what we set out to do would be useful. What I have in mind how is this going to be used. Are we going to rely on users to set the deferred annotation in patch profiles? Or do we expect PPC to set this? If the latter, i believe a follow-up PR(s) will be necessary. |
yes, this is the infra work. We will need followup PR(s) for the performanceprofile to consume this work, because performanceprofile is a higher level construct building on NTO. |
regarding the behavior of the change, we have quite some decent e2e test coverage which ensure and document the observable behavior. More reviews are welcome (and have always be welcome in the past months) |
I guess we are waiting for the |
Sometimes these go away automatically, sometimes "slash test all" helps. Tried |
/test all |
it seems the label requirement wasn't lifted yet :\ |
/label acknowledge-critical-fixes-only |
@ffromani If we're on a hurry most of my comments can be postpone for future improvement. |
/hold cancel got another review |
@ffromani: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@ffromani: Jira Issue OCPBUGS-28647: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-28647 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
xref: openshift#1019 (comment) Signed-off-by: Francesco Romani <[email protected]>
WIP xref: openshift#1019 (comment) Signed-off-by: Francesco Romani <[email protected]>
Simplify the flow aligning code to the left. No intended changes in behavior. xref: openshift#1019 (comment) Signed-off-by: Francesco Romani <[email protected]>
xref: openshift#1019 (comment) Signed-off-by: Francesco Romani <[email protected]>
Simplify the flow aligning code to the left. No intended changes in behavior. xref: openshift#1019 (comment) Signed-off-by: Francesco Romani <[email protected]>
followup: #1118 |
Fix the runtime directory to be forward compatible xref: openshift#1019 (comment) Signed-off-by: Francesco Romani <[email protected]>
* tuned: controller: pack profilesExtract return values in struct xref: #1019 (comment) Signed-off-by: Francesco Romani <[email protected]> * tuned: controller: align to left Simplify the flow aligning code to the left. No intended changes in behavior. xref: #1019 (comment) Signed-off-by: Francesco Romani <[email protected]> * tuned: controller: narrow down reapplySysctl Simplify the code with no intended changes in behavior. Signed-off-by: Francesco Romani <[email protected]> * e2e: deferred: add ginkgo labels and update the tags to match. This way we can easily skip the flaky tests The naming schema is made consistent with performanceprofile labels: test/e2e/performanceprofile/functests/utils/label/label.go note the recommended kube model is to use `CapitalizedNouns`, while we do `lowercase-dash-separated`. Signed-off-by: Francesco Romani <[email protected]> * e2e: skip flaky tests in nightly runs nightlies call `make test-e2e` so skip the flaky tests here. Presubmit lanes (= running pre-merge) should still run all tests Signed-off-by: Francesco Romani <[email protected]> * tuned: controller: fix run directory Fix the runtime directory to be forward compatible xref: #1019 (comment) Signed-off-by: Francesco Romani <[email protected]> --------- Signed-off-by: Francesco Romani <[email protected]>
Users are able to make tuning changes through NTO/TuneD by updating the CRs on a running system.
When the tuning change is done, the NTO causes the tuned process to restarts, whose unwanted side effect is the tuned process rolling back ALL the tuning and then reapplying it.
In some key use cases, this disrupts the workload behavior.
We add an annotation that cause the update not to be applied immediately, but to be deferred till the next node restart. The NTO operator will continue handle the tuned objects as it does today.
This includes unpacking all profiles on the node.
The deferred annotation would be deduplicated from the
tuned
object to all the containedtunedprofiles
. If the recommended profile computed for a node is originated from atuned
object which has the annotation, then the deferred annotation will be "sticky", and the resulting update will be deferred.However, if there is a deferred updated pending but the tuned objects are edited to remove the annotation, then the update will become
immediate
(like they were before this feature and like they are by default without any annotation), and they be applied overwriting the pending deferred update, which will then be lost.This applies in the case of the recommended profile will be computed from a non-deferred update.
The deferred updates can stack, meaning a deferred update can be applied on top of a previous deferred update; the latter will just overwrite the former as it never was received. An immediate update can overwrite a deferred update, and clear it.
Like the immediate update, the NTO operand will take care of applying the deferred updates, as such:
The NTO operand will gain another chance to process tuned objects at startup, right before the main loops start.
If a tuned object is detected and it has NOT the deferred annotation:
If a tuned object is detected and it DOES HAVE the deferred annotation:
IOW, the startup reconcile code will engage IFF the rendered tuned has the deferred annotation and the filesystem flag does exist. Likewise, the regular reconcile loop will only set the filesystem flag if detects a rendered tuned with the deferred annotation.
The default is obviously kept to immediate updates, for backward compatibility.