You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A workload can get stuck forever with Evicted=True if the external controller sets state of the admission check to Retry while Evicted=True.
The scenario does not seem to happen consistently, but this is the root cause of the issue here: #3365 (comment). As a consequence the workload could not get re-admitted.
The issue has a workaround at the level of external admission check, to guard setting the Retry for the AC state whilst Evicted=True, as here.
Then Kueue flips the Retry to Pending, but it is stuck with Evicted=True forever. This is the final status:
Status:
Admission Checks:
Last Transition Time: 2024-11-14T18:09:17Z
Message: The workload is pending on Prefetch Admission Check
Name: custom-ac
State: Pending
Conditions:
Last Transition Time: 2024-11-14T18:08:57Z
Message: The workload has failed admission checks
Observed Generation: 1
Reason: Pending
Status: False
Type: QuotaReserved
Last Transition Time: 2024-11-14T18:08:57Z
Message: At least one admission check is false
Observed Generation: 1
Reason: AdmissionCheck
Status: True
Type: Evicted
Last Transition Time: 2024-11-14T18:08:57Z
Message: The workload backoff was finished
Observed Generation: 1
Reason: BackoffFinished
Status: True
Type: Requeued
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal QuotaReserved 13m kueue-admission Quota reserved in ClusterQue
ue cluster-queue, wait time since queued was 0s
Normal EvictedDueToAdmissionCheck 13m kueue-workload-controller At least one admission check
is false
Warning Pending 13m kueue-admission The workload has failed admi
ssion checks
Some observations: the workload get re-admitted when we manually set Evicted=False - I expect Kueue should do it on its own.
What you expected to happen:
I think Kueue should be able to recover from the situation on its own, and finalize eviction of the workload, allowing it to get re-admitted.
How to reproduce it (as minimally and precisely as possible):
More details in the issue or @leipanhz can share, but basically the external AC was setting Retry while Kueue was evicting the workload. I think we should be able to reproduce this with integration tests.
The text was updated successfully, but these errors were encountered:
@mimowo Thanks for creating a ticket tracking this.
I observed some unexcepted behaviors after applying for the workaround, commenting here:
In the custom controller, the requeue interval after setting to "Retry" is 5 seconds, however from the log, I see 28 times in 2 seconds the reconciler tries to set the AC status from Pending to Retry. Seems like although Kueue evicts workload after AC is in retry status, it un-evicts it and reserves quota right away, so the status is back to Pending, Then Reconciler sets it back to Retry... It's like a race condition.
What happened:
A workload can get stuck forever with Evicted=True if the external controller sets state of the admission check to Retry while Evicted=True.
The scenario does not seem to happen consistently, but this is the root cause of the issue here: #3365 (comment). As a consequence the workload could not get re-admitted.
The issue has a workaround at the level of external admission check, to guard setting the Retry for the AC state whilst Evicted=True, as here.
Then Kueue flips the Retry to Pending, but it is stuck with Evicted=True forever. This is the final status:
Some observations: the workload get re-admitted when we manually set
Evicted=False
- I expect Kueue should do it on its own.What you expected to happen:
I think Kueue should be able to recover from the situation on its own, and finalize eviction of the workload, allowing it to get re-admitted.
How to reproduce it (as minimally and precisely as possible):
More details in the issue or @leipanhz can share, but basically the external AC was setting Retry while Kueue was evicting the workload. I think we should be able to reproduce this with integration tests.
The text was updated successfully, but these errors were encountered: