Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [nodeutilization]: allow to set/remove a soft taint from overutilized nodes #1625

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tiraboschi
Copy link

@tiraboschi tiraboschi commented Feb 6, 2025

Dynamically set/remove a soft taint (effect: PreferNoSchedule) to/from nodes that the descheduler identified as overutilized according to the nodeutilization plugin.
After #1555 the nodeutilization plugin can consume utilization data from actual kubernetes metrics and not only static reservations (requests).
On the other side, the default kube-scheduler is only going to ensure that Pod's specific resource requests can be satisfied.
The soft taint will simply gave an hint to the scheduler to try avoiding nodes that looks overutilized at
descheduler eyes.
Since it's just a soft taint (do it just if possible) there is no need to restrict it only to a certain amount of nodes in the cluster or be strict on error handling.
With a fine-grained ValidatingAdmissionPolicy (GA in k8s 1.30)
we can enforce that the descheduler-sa is only allowed
to apply/remove soft-taints (enforcing PreferNoSchedule effect)
and with a predefined key-prefix (eventually customizable with a parameter).

TODO: add more unit and e2e tests

Fixes: #1626

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 6, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @tiraboschi!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @tiraboschi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested review from a7i and damemi February 6, 2025 16:50
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign seanmalloy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@googs1025
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2025
@tiraboschi tiraboschi force-pushed the softtainter branch 2 times, most recently from 61f2819 to 00d0c86 Compare February 11, 2025 15:07
@googs1025
Copy link
Member

googs1025 commented Feb 12, 2025

/cc
I will try to understand this pr

@k8s-ci-robot
Copy link
Contributor

@googs1025: GitHub didn't allow me to request PR reviews from the following users: I, will, try, to, understand, this, pr.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc I will try to understand this pr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@@ -291,19 +291,28 @@ like `kubectl top`) may differ from the calculated consumption, due to these com
actual usage metrics. Metrics-based descheduling can be enabled by setting `metricsUtilization.metricsServer` field.
In order to have the plugin consume the metrics the metric collector needs to be configured as well.
See `metricsCollector` field at [Top Level configuration](#top-level-configuration) for available options.
That gap can only be reduced using the SoftTainter to dinamically apply/remove soft taints (`effect: PreferNoSchedule`) to/from nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if setting taint to node has left the descheduler itself 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the key is on Soft (effect: PreferNoSchedule) taints only. I tend to this that having the descheduler setting also hard taints is a bad idea.
In the last update I also added a ValidatingAdmissionPolicy (GA in k8s 1.30) with a few rules to enforce that the ServiceAcoount used by the scheduler is only allowed to set/delete/amend its own taints (by a fixed key prefix) and only if with effect: PreferNoSchedule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am concerned about is whether this is not a feature of the descheduler. I understand that after rescheduling, new pods may still be scheduled to this node again. But we need to think whether adding taints to the descheduler is a good choice. Maybe we should use a custom component to accomplish this? 🤔 I don't seem to see other plugins that have similar operations for modifying node fields or taints (maybe I'm wrong, I haven't delved into all the plugins).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar issue from : kubernetes/node-problem-detector#565

There are similar discussions in the NDP project. I understand that the two projects are different, but when we introduce similar feature, do we have to think clearly or consider it more general? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the NPD front the discussion was more about hard taints "effect": "NoSchedule" to filter out problematic nodes and yes, in that case all the concerns are valid since, for instance, hard tainting a relevant amount of nodes could lead the cluster in a position where it could not have enough resource to run expected pods.
Here instead we are only talking about soft taints as an hint for the scheduler to avoid overutilized nodes if possible.
Please notice that in the scheduler (intolerable) soft taints are simply counted for the scoring step:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/tainttoleration/taint_toleration.go#L198
But they are not going to affect the filtering step.
And since the soft taints are already counted to weight a node, neither having two independent controllers setting them is really an issue since a node with more than one soft taint will symple receive a worst score than a node with just a single intolerable soft taint.
We can eventually also think about leveraging that counting mechanism also here applying a variable number of soft taints according to how much a node is over utilized to get nodes slightly overutilized preferred against nodes that are considerably over utilized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is: Does this deviate from the original intention of the descheduler? But I understand that the feature is reasonable. 🤔

// IsNodeSoftTainted checks if the node is already tainted
// with the provided key and value and PreferNoSchedule effect.
// If a match is found, it returns true.
func IsNodeSoftTainted(node *v1.Node, taintKey, taintValue string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsNodeSoftTainted AddOrUpdateSoftTaint RemoveSoftTaint maybe we can remove Soft

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just to emphasize that the descheduler is only supposed to add Soft taints as an hint for the scheduler to avoid overutilized nodes if possible without really interfering with the scheduling process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got this. I missed TaintEffectPreferNoSchedule

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2025
@tiraboschi tiraboschi force-pushed the softtainter branch 4 times, most recently from 9cbd483 to f57c4e8 Compare February 17, 2025 14:01
…lized nodes

Dynamically set/remove a soft taint (effect: PreferNoSchedule)
to/from nodes that the descheduler identified as overutilized
according to the nodeutilization plugin.
After kubernetes-sigs#1555 the nodeutilization plugin can consume
utilization data from actual kubernetes metrics and
not only static reservations (requests).
On the other side, the default kube-scheduler is only
going to ensure that Pod's specific resource requests
can be satified.
The soft taint will simply gave an hint to the scheduler
to try avoiding nodes that looks overutilized at
the descheduler eyes.
Since it's just a soft taint (do it just if possible)
there is no need to restrict it only to a certain amount
of nodes in the cluster or be strict on error handling.
With a fine-grained `ValidatingAdmissionPolicy` (GA in k8s 1.30)
we can enforce that the `deschduler-sa` is only allowed
to apply/remove soft-taints (enforcing `PreferNoSchedule` effect)
and with a predefined key-prefix
(eventually customizable with a parameter).

TODO: add more unit and e2e tests

Signed-off-by: Simone Tiraboschi <[email protected]>
@JaneLiuL
Copy link
Member

JaneLiuL commented Mar 5, 2025

welcome for First-time contributor~~ look forward to see you more and more PR and contribute~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: LowNodeUtilization can soft taint overutilized nodes
4 participants