Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-28647: tuned: operand: add support for deferred updates #1019

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Apr 4, 2024

Users are able to make tuning changes through NTO/TuneD by updating the CRs on a running system.
When the tuning change is done, the NTO sends a SIGHUP to the tuned process, which results in the tuned process rolling back ALL the tuning and then reapplying it.
In some key use cases, this disrupts the workload behavior.

We add an annotation that cause the update not to be applied immediately, but to be deferred till the next tuned restart. The NTO operator will continue merging the tuned objects into the special rendered object as it does today.
The deferred annotation will be "sticky". If even just one of the current tuned object (e.g. just the perfprofile one) have the annotation, the rendered tuned will.

However, if there is a deferred updated pending but the tuned objects are edited in such a way the final rendered update will be immediate (= no deferred annotation), then the immediate update will be applied overwriting the pending deferred update, which will then be lost.

The deferred updates can stack, meaning a deferred update can be applied on top of a previous deferred update; the latter will just overwrite the former as it never was received. An immediate update can overwrite a deferred update, and clear it.

Like the immediate update, the NTO operand will take care of applying the deferred updates, as such:

The NTO operand will gain another chance to process tuned objects at startup, right before the main loops start.

If a tuned object is detected and it has NOT the deferred annotation:

  • the regular reconcile code will work as today
  • the startup reconcile code will not engage

If a tuned object is detected and it DOES HAVE the deferred annotation:

  • the regular reconcile code will set a filesystem flag, will move the status DEGRADED and will ignore the object.
  • at the first followup operand restart, the startup reconcile code will notice the filesystem flag, apply the rendered tuned objects (granted it DOES have the deferred flag, otherwise it will log and skip), clear the filesystem flag. The DEGRADED state will be cleared.

IOW, the startup reconcile code will engage IFF the rendered tuned has the deferred annotation and the filesystem flag does exist. Likewise, the regular reconcile loop will only set the filesystem flag if detects a rendered tuned with the deferred annotation.

The default is obviously kept to immediate updates, for backward compatibility.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 4, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2024
@openshift-ci-robot
Copy link
Contributor

@ffromani: This pull request references Jira Issue OCPBUGS-28647, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Users are able to make tuning changes through NTO/TuneD by updating the CRs on a running system.
When the tuning change is done, the NTO sends a SIGHUP to the tuned process, which results in the tuned process rolling back ALL the tuning and then reapplying it.
In some key use cases, this disrupts the workload behavior.

We add an annotation that cause the update not to be applied immediately, but to be deferred till the next tuned restart. The NTO operator will continue merging the tuned objects into the special rendered object as it does today.
The deferred annotation will be "sticky". If even just one of the current tuned object (e.g. just the perfprofile one) have the annotation, the rendered tuned will.

However, if there is a deferred updated pending but the tuned objects are edited in such a way the final rendered update will be immediate (= no deferred annotation), then the immediate update will be applied overwriting the pending deferred update, which will then be lost.

The deferred updates can stack, meaning a deferred update can be applied on top of a previous deferred update; the latter will just overwrite the former as it never was received. An immediate update can overwrite a deferred update, and clear it.

Like the immediate update, the NTO operand will take care of applying the deferred updates, as such:

The NTO operand will gain another chance to process tuned objects at startup, right before the main loops start.

If a tuned object is detected and it has NOT the deferred annotation:

  • the regular reconcile code will work as today
  • the startup reconcile code will not engage

If a tuned object is detected and it DOES HAVE the deferred annotation:

  • the regular reconcile code will set a filesystem flag, will move the status DEGRADED and will ignore the object.
  • at the first followup operand restart, the startup reconcile code will notice the filesystem flag, apply the rendered tuned objects (granted it DOES have the deferred flag, otherwise it will log and skip), clear the filesystem flag. The DEGRADED state will be cleared.

IOW, the startup reconcile code will engage IFF the rendered tuned has the deferred annotation and the filesystem flag does exist. Likewise, the regular reconcile loop will only set the filesystem flag if detects a rendered tuned with the deferred annotation.

The default is obviously kept to immediate updates, for backward compatibility.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ffromani
Copy link
Contributor Author

ffromani commented Apr 4, 2024

/jira refresh

@openshift-ci openshift-ci bot requested review from MarSik and yanirq April 4, 2024 10:15
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 4, 2024
@openshift-ci-robot
Copy link
Contributor

@ffromani: This pull request references Jira Issue OCPBUGS-28647, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gsr-shanks

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from gsr-shanks April 4, 2024 10:15
Copy link
Contributor

openshift-ci bot commented Apr 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ffromani ffromani changed the title WIP: OCPBUGS-28647: tuned: operand: add support for deferred updates WIP: POC: OCPBUGS-28647: tuned: operand: add support for deferred updates Apr 4, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2024
@ffromani ffromani force-pushed the tuned-deferred-updates branch 4 times, most recently from f0b7ece to 2314db4 Compare May 8, 2024 17:19
@ffromani
Copy link
Contributor Author

ffromani commented May 9, 2024

/retest

@ffromani
Copy link
Contributor Author

ffromani commented May 9, 2024

 ERRO[2024-05-09T07:59:44Z] 
  * could not run steps: step e2e-gcp-pao failed: "e2e-gcp-pao" pre steps failed: "e2e-gcp-pao" pod "e2e-gcp-pao-ipi-install-install" failed: could not watch pod: the pod ci-op-t8wxg3mq/e2e-gcp-pao-ipi-install-install failed after 1h0m10s (failed containers: test): ContainerFailed one or more containers exited

stil looks as infra issue to me.

@ffromani ffromani force-pushed the tuned-deferred-updates branch 2 times, most recently from 04635db to f0873a9 Compare May 10, 2024 09:29
@ffromani ffromani changed the title WIP: POC: OCPBUGS-28647: tuned: operand: add support for deferred updates OCPBUGS-28647: tuned: operand: add support for deferred updates May 10, 2024
@ffromani
Copy link
Contributor Author

the PR is now reviewable

add support to report the deferred status in
conditions.

Signed-off-by: Francesco Romani <[email protected]>
TunedProfiles -> tunedProfiles
trivial rename. The function was not used elsewhere anyway.

Signed-off-by: Francesco Romani <[email protected]>
rework profilesExtract and helper functions
to make more testable, enable round-tripping,
and add helpers which will be used in the deferred
update functionality.

Signed-off-by: Francesco Romani <[email protected]>
Simplify the interface packing the many return values in
a single struct, to enable further extensibility.
No intended changes in behavior.

Signed-off-by: Francesco Romani <[email protected]>
Gather the node name once and use it in the lifecycle of
the controller - is not supposed to change anyway.
Bubble up the client creation, so we don't need to
keep around the kubeconfig.

Signed-off-by: Francesco Romani <[email protected]>
Align to left for readability with no intended changes in behavior.

Signed-off-by: Francesco Romani <[email protected]>
Read extracted profiles on startup and reconstruct
the node state, to be used ro implement the
deferred updates feature.
Please note all the operations are non-destructive
read-only so pretty safe.

Signed-off-by: Francesco Romani <[email protected]>
The deferred updates feature need to do specific
actions when the tuned daemon is reloaded (and only then).
Add hook point with specific code for that.

Signed-off-by: Francesco Romani <[email protected]>
Rework how tuned controller updates status to make
room for the deferred update feature.

Signed-off-by: Francesco Romani <[email protected]>
Now that all the pieces are in place, we can set
and use the deferred update flag to implement
the desired behavior.

Signed-off-by: Francesco Romani <[email protected]>
clarify the logs when the tuned controller reacts to a write event.

Signed-off-by: Francesco Romani <[email protected]>
Trigger an event when the recommend file is written,
and always do that even if does not change.
This is needed to react properly when a deferred
update is un-deferred by editing the tuned object.

Without a trigger in this state the k8s object
status wouldn't be correctly updated.

Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
@MarSik
Copy link
Contributor

MarSik commented May 22, 2024

@ffromani Will this apply the profile even when you just kill the tuned pod on a node? Or is there logic to wait for clean reboot (say using a /var/run/ file or its absence)?

@ffromani
Copy link
Contributor Author

@ffromani Will this apply the profile even when you just kill the tuned pod on a node? Or is there logic to wait for clean reboot (say using a /var/run/ file or its absence)?

we will apply the profile on each tuned restart, not on node reboot

@jmencak
Copy link
Contributor

jmencak commented May 22, 2024

/retest

1 similar comment
@jmencak
Copy link
Contributor

jmencak commented May 23, 2024

/retest

@ffromani
Copy link
Contributor Author

level=error msg=These cluster operators were not stable: [node-tuning]

I wonder if my changes are causing this CI failure. Testing on my cluster revelead no issue whatsoever though.
I'll run a bisect with a test PR

@jmencak
Copy link
Contributor

jmencak commented May 23, 2024

level=error msg=These cluster operators were not stable: [node-tuning]

I wonder if my changes are causing this CI failure. Testing on my cluster revelead no issue whatsoever though. I'll run a bisect with a test PR

Good find. Yeah, I suspect updates or flips with ClusterOperator/node-tuning. I've noticed in the past that simple PRs to this repo passed, whereas this bigger change was failing multiple tests.

@ffromani
Copy link
Contributor Author

level=error msg=These cluster operators were not stable: [node-tuning]

I wonder if my changes are causing this CI failure. Testing on my cluster revelead no issue whatsoever though. I'll run a bisect with a test PR

#1065

Copy link
Contributor

openshift-ci bot commented May 23, 2024

@ffromani: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator 5672523 link true /test e2e-aws-operator
ci/prow/e2e-aws-ovn 5672523 link true /test e2e-aws-ovn
ci/prow/e2e-gcp-pao 5672523 link true /test e2e-gcp-pao
ci/prow/e2e-gcp-pao-workloadhints 5672523 link true /test e2e-gcp-pao-workloadhints
ci/prow/e2e-gcp-pao-updating-profile 5672523 link true /test e2e-gcp-pao-updating-profile
ci/prow/e2e-hypershift 5672523 link true /test e2e-hypershift
ci/prow/e2e-aws-ovn-techpreview 5672523 link true /test e2e-aws-ovn-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ffromani
Copy link
Contributor Author

/hold

let's stop retesting until #1065 sheds some light here

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 23, 2024
@jmencak
Copy link
Contributor

jmencak commented May 23, 2024

@ffromani Will this apply the profile even when you just kill the tuned pod on a node? Or is there logic to wait for clean reboot (say using a /var/run/ file or its absence)?

we will apply the profile on each tuned restart, not on node reboot

From my testing, yes, if you delete the TuneD pod, the current code as is will apply the new profile even though it has the deferred annotation set. It will, however refuse to update a current profile when you create a new Tuned CR with deferred annotation set.

So far, I've noticed one issue. If I reboot the node and the system has a profile with deferred annotation set, I still see:

$ oc get co/node-tuning
NAME          VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
node-tuning   4.16.0-0.nightly-2024-05-21-221942   True        True          False      16m     Waiting for 1/1 Profiles to be applied

even though the profile has been successfully applied.

@ffromani
Copy link
Contributor Author

So far, I've noticed one issue. If I reboot the node and the system has a profile with deferred annotation set, I still see:

$ oc get co/node-tuning
NAME          VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
node-tuning   4.16.0-0.nightly-2024-05-21-221942   True        True          False      16m     Waiting for 1/1 Profiles to be applied

even though the profile has been successfully applied.

thanks, this is new. I'll check once I get rid of the CI woes

@jmencak
Copy link
Contributor

jmencak commented May 23, 2024

thanks, this is new. I'll check once I get rid of the CI woes

I wonder if that could be related. CI doesn't like things like CO objects progressing, etc. Another thing is whether we want to block the update until the next reboot, i.e. what @MarSik hinted at. Should the pod deletion apply the deferred update?

@ffromani
Copy link
Contributor Author

thanks, this is new. I'll check once I get rid of the CI woes

I wonder if that could be related. CI doesn't like things like CO objects progressing, etc. Another thing is whether we want to block the update until the next reboot, i.e. what @MarSik hinted at. Should the pod deletion apply the deferred update?

This needs to be verified (by yours truly) but the failure seems into the early stage of CI setup, well before the e2e tests run. And we create profiles with the deferred annotation only in the newly added e2e tests.

Copy link
Contributor

@jmencak jmencak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work. I like the changes. I found only a few nits so far, but I haven't finished the full review, it is a lot of code. Sharing a part of it.

return profileData == string(content)
}

// ProfilesExtract extracts TuneD daemon profiles to tunedProfilesDirCustom directory.
// Returns:
// - True if the data in the to-be-extracted recommended profile or the profiles being
// included from the current recommended profile have changed.
// - If the data changed, the fingerprint of the new profile, or "" otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SuperNit: Let's be consistent.

s/or "" otherwise/or "" otherwise./

Similarly elsewhere, for example in the Daemon/Change struct. I try to add punctuation when a line starts a comment and feels like a sentence/paragraph. I typically don't add it for very short comments that don't feel like a sentence. These are typically on the same line with a code.

recommendedProfileDeps[recommendedProfile] = true
}
extracted := map[string]bool{} // TuneD profile names present in TuneD CR and successfully extracted to tunedProfilesDirCustom
if len(recommendedProfile) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "heritage" from my previous code, but I believe that this check should be moved to tuned_parser.go . Then, we can probably remove this function completely and just move the assignment

        // Add the recommended profile itself.
        recommendedProfileDeps[recommendedProfile] = true

If you agree, then, there'll be little point to have profilesExtractPathRecommends() too and we can have the whole code in ProfilesExtract() as it was before and there will be no need to document/comment/explain the new functions.

return change, profilesFP, nil
}

func profilesFingerprint(profiles []tunedv1.TunedProfile, recommendedProfile string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function deserves a comment. Something along the lines of:

// profilesFingerprint returns a hash of `recommendedProfile` name joined with the data sections of all TuneD profiles in the `profiles` slice.

func profilesFingerprint(profiles []tunedv1.TunedProfile, recommendedProfile string) string {
h := sha256.New()
h.Write([]byte(recommendedProfile))
for _, prof := range profiles {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect any issues with profile ordering? Will they always come ordered the same way, e.g. profile a prior to profile b and not vice versa the next time?

return change, profilesFP, extracted, recommendedProfileDeps, nil
}

func profilesRepackPath(rfPath, profilesRootDir string) ([]tunedv1.TunedProfile, string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need rfPath, cannot we just use tunedRecommendFile ? Similarly profilesRootDir.

Is the name Perl inspired by pack/unpack functions? I believe this function deserves a comment and possibly even function renaming. Comment something along the lines of:

profiles(Re)[Pp]ackFromPath reconstructs the slice of TunedProfile objects out of the TuneD configuration files from `tunedProfilesDirCustom`.

return profiles, recommendedProfile, nil
}

func ProfilesRepack() ([]tunedv1.TunedProfile, string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to export this? Also, do we need such a "shallow" function?

klog.Error(err.Error())
return false // retry later
}
func (c *Controller) changeSyncerPostReload(change Change) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need some comments to describe the general purpose of this function. Possibly even adjust its name to changeSyncerPostDaemonReload ?

}

klog.Infof("changeSyncerPostReload(): current effective profile fingerprint %q -> %q", c.daemon.profileFingerprintEffective, profileFP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log this every time even though the fingerprint hasn't changed? I.e. does it bring any value or are we just poluting logs?

klog.Infof("changeSyncer(%s)", change.String()) // TODO v=2
defer klog.Infof("changeSyncer(%s) done", change.String()) // TODO v=2
// Sync internal status after a TuneD reload
_ = c.changeSyncerPostReload(change)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the return value for anything?

Copy link
Contributor

@jmencak jmencak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a few more comments. In my view, the main things that need to be resolved:

  • CI issues
  • As agreed over Slack, the pod restart should not apply a new profile when deferred annotation is set.
  • co/node-tuning status

After that, I'll do another round of review and testing.

@@ -88,7 +88,7 @@ func InitializeStatusConditions() []tunedv1.ProfileStatusCondition {
// 'status' contains all the information necessary for creating a new slice of
// conditions apart from LastTransitionTime, which is set based on checking the
// old conditions.
func computeStatusConditions(status Bits, stderr string, conditions []tunedv1.ProfileStatusCondition) []tunedv1.ProfileStatusCondition {
func computeStatusConditions(status Bits, message string, conditions []tunedv1.ProfileStatusCondition) []tunedv1.ProfileStatusCondition {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional parameter was added to the function. Let's explain what it does by updating the comments above.

}

var message string
deferred := (change.deferred || util.HasDeferredUpdateAnnotation(profile.Annotations))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both or is util.HasDeferredUpdateAnnotation(profile.Annotations) enough?

klog.V(1).Infof("write event on: %s", fsEvent.Name)
// Notify the event processor that the TuneD daemon calculated new kernel command-line parameters.
klog.Infof("write event on: %s", fsEvent.Name)
// Notify the event processor that profiles where unpacked on the node or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a comment which is slightly misleading. We're watching writes to bootcmdline and recommend file, not unpacking profiles on the node. They are unpacked, but that's not why the write event was triggered.

}

klog.Infof("changeSyncerPostReload(): current effective profile fingerprint %q -> %q", c.daemon.profileFingerprintEffective, profileFP)
c.daemon.profileFingerprintEffective = profileFP
c.daemon.status &= ^scDeferred // force clear even if it was never set. No biggie.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop the drop the familiar "No biggie."?

} else if (status & scDeferred) != 0 {
tunedDegradedCondition.Status = corev1.ConditionTrue
tunedDegradedCondition.Reason = "TunedDeferredUpdate"
tunedDegradedCondition.Message = "Profile will be applied at the next daemon restart" + deferredMessage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be consistent here

s/next daemon restart/next daemon restart: /

@@ -0,0 +1,407 @@
package e2e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the test in e2e/deferred/ actually run? I don't see the Makefile adjusted so that they run automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants