-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not consider preemption.borrowWithinCohort in FairSharing preemptions #4165
base: main
Are you sure you want to change the base?
Do not consider preemption.borrowWithinCohort in FairSharing preemptions #4165
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/cc |
/hold If we decide the combination is useful we may need to re-write the PR. If we decide the combination is not useful I would like to follow it up with validation which prevents combining the settings. |
One more consideration: this PR tries to break the current FairSharing feature specification (before this PR, the borrowWithinCohort could be involved in the FairSharing decision). Ideally, we want to looking for another solution to resolve #3779. If we decide to go with this solution, we might want to prepare any mitigation way since this already has been published as Beta feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great to me. The next step is decision whether we should take this way.
I gave it another round of thinking. Actually, I think the root cause for the infinite loop is that we do So, I have another idea to turn this check to just This will make sure no workloads bypass the strategy, yet we require the target workloads to be below the priority threshold. I think it is likely a valid usecase to safe-guard high priority workloads from fair sharing, and the field allows for that. I think the idea of the EDIT: I would be happy to move forward with the approach in this comment (unless you see some counter-examples) and backport it, regardless of the long-term plans of combining the configurations. |
My concern is making fairSharing complicated since the idea indicates cooperation between But, I'm not honestly sure if whieh solution can be understandable for users. So, here, it might be better to proceed with your (@mimowo) approach, and then consider the above approach (fairSharing with filter mechanism) based on user feedback. |
This protection is defined on the preempting CQ's side. Which means, for these high priority workloads to be safe, this setting needs to be applied to all CQs. I think a better interface would be preemption protection defined by the CQ that wants to protect its workloads. Or, for that CQ to simply not borrow for a flavor it doesn't want to risk preemptions in.
This still leads to weird behavior when the workload wants to do reclaimWithinCohort. With this proposal, we would filter valid reclaimWithinCohort targets, so that the semantics no longer match non-fair sharing reclamation. We could add a condition to do reclaim when not borrowing [1][2] - but I'm opposed to introducing more complexity into an already complex algorithm. I think we should leave these decisions to the result of the fair share value, to to keep the complexity manageable. notes: |
That was also my thought, but in a way FairSharing is already aware of other candidate filtering configurations based on |
We already use
The notes are valid, and I agree this approach leads to some sub-optimal decisions given the dynamic nature of "borrowing" as we progress through candidates. However:
Depends how you look at it. IMO, the complexity of would be anyway lower compared to the current "main" branch, because we would essentially commit your changes inside the In the end - I'm fine with the approach proposed in this PR as long as we confirm somehow this does not break any business use cases, but I'm worried confirming this might be difficult in practice given the OSS nature of the software. When we disable configuration it is generally better to go over some deprecation (feature-gate) process, and short term has something which can be cherry-picked as bugfixing with marginal risk of breaking. |
Discussed and reached consensus with @mimowo offline The infinite preemption issue (which this PR fixes), and whether we limit preemptions to a threshold priority (potentially how users expect this parameter to work with FairSharing), are separate issues. Since the latter didn't exist in FairSharing in I disagree with the approach of using
@tenzen-y Please take a look at #4173. I think we can add a threshold in |
Yes, I agree that supporting the So, I'm happy to just make it no-op. However, I would like to update the documentation for the Ideally, we would also have some warnings in webhooks and on Kueue startup, but if this requires substantial ground work then I'm ok with just documentation update. |
I think that it could be possible, easily. I guess that we can just pass the Kueue Configuration to the CQ webhook server, and hold it in
The validation could be performed when CQ is created or updated. |
a2a965f
to
1d96117
Compare
1d96117
to
28653b2
Compare
40a8211
to
88c83fb
Compare
Added this validation. I only tested in unit tests, not in integration tests, as it would have required some refactoring of webhook integration tests to pass this new configuration. I can update these tests if you think is worth the effort. |
88c83fb
to
9c2c491
Compare
@gabesaba: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
9c2c491
to
9e733cf
Compare
The strong validation makes me feel uncomfortable - I'm thinking about the migration from classical preemption to fair sharing. Consider the two migration options for users who use borrowWithinCohort:
Also, there is no easy way to experiment (enable / disable fair sharing to compare the impact). It will require scripting the enablement and disablement procedures to make it quick. I think ideally the validation would be "soft" warnings rather than "hard". I would suggest splitting the PR into two:
So that we can make the decision independently. I'm convinced we want to cherry-pick 1, but (2.) makes me feel less comfortable. I think ideally this is not strong validation, but soft. We have two types of validation in the core k8s, see: If you @tenzen-y and @gabesaba think strong validation is ok, I'm good to go, but even then prefer two PRs to be able to easily bailout (revert) just the validation. |
9e733cf
to
8390d77
Compare
Split out to #4298 Let's discuss there whether we add soft or strong validation, and whether or not to cherry-pick that change |
I fully agree with you. |
Removed the validations from this PR, so it is only fixing the bug and updating documentation. I also updated the release note. |
/lgtm |
@mimowo: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gabesaba, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM label has been added. Git tree hash: f7ddced2404225cdff9d823bdb5ca949983d2ded
|
Leaving the hold label to release to @gabesaba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
for API comment appearance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabesaba Could you investigate the reason why new API comments are not recorded in CRD?
What type of PR is this?
/kind bug
What this PR does / why we need it:
preemption.borrowWithinCohort is being used to override dominantResourceShare, causing preemptions which do not match scheduling order, resulting in the bug described here. We do not document preemption.borrowWithinCohort as applying to FairSharing, but only the other two preemption policies - link.
Which issue(s) this PR fixes:
Fixes #3779
Special notes for your reviewer:
Does this PR introduce a user-facing change?