Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workload can get stuck indefinitely when using external AdmissionCheck #3543

Open
mimowo opened this issue Nov 15, 2024 · 3 comments
Open

Workload can get stuck indefinitely when using external AdmissionCheck #3543

mimowo opened this issue Nov 15, 2024 · 3 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mimowo
Copy link
Contributor

mimowo commented Nov 15, 2024

What happened:

A workload can get stuck forever with Evicted=True if the external controller sets state of the admission check to Retry while Evicted=True.

The scenario does not seem to happen consistently, but this is the root cause of the issue here: #3365 (comment). As a consequence the workload could not get re-admitted.

The issue has a workaround at the level of external admission check, to guard setting the Retry for the AC state whilst Evicted=True, as here.

Then Kueue flips the Retry to Pending, but it is stuck with Evicted=True forever. This is the final status:

Status:
  Admission Checks:
    Last Transition Time:  2024-11-14T18:09:17Z
    Message:               The workload is pending on Prefetch Admission Check
    Name:                  custom-ac
    State:                 Pending
  Conditions:
    Last Transition Time:  2024-11-14T18:08:57Z
    Message:               The workload has failed admission checks
    Observed Generation:   1
    Reason:                Pending
    Status:                False
    Type:                  QuotaReserved
    Last Transition Time:  2024-11-14T18:08:57Z
    Message:               At least one admission check is false
    Observed Generation:   1
    Reason:                AdmissionCheck
    Status:                True
    Type:                  Evicted
    Last Transition Time:  2024-11-14T18:08:57Z
    Message:               The workload backoff was finished
    Observed Generation:   1
    Reason:                BackoffFinished
    Status:                True
    Type:                  Requeued
Events:
  Type     Reason                      Age   From                       Message
  ----     ------                      ----  ----                       -------
  Normal   QuotaReserved               13m   kueue-admission            Quota reserved in ClusterQue
ue cluster-queue, wait time since queued was 0s
  Normal   EvictedDueToAdmissionCheck  13m   kueue-workload-controller  At least one admission check
 is false
  Warning  Pending                     13m   kueue-admission            The workload has failed admi
ssion checks

Some observations: the workload get re-admitted when we manually set Evicted=False - I expect Kueue should do it on its own.

What you expected to happen:

I think Kueue should be able to recover from the situation on its own, and finalize eviction of the workload, allowing it to get re-admitted.

How to reproduce it (as minimally and precisely as possible):

More details in the issue or @leipanhz can share, but basically the external AC was setting Retry while Kueue was evicting the workload. I think we should be able to reproduce this with integration tests.

@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Nov 15, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Nov 15, 2024

cc @mbobrovskyi @PBundyra

@mszadkow
Copy link
Contributor

/assign

@leipanhz
Copy link

@mimowo Thanks for creating a ticket tracking this.

I observed some unexcepted behaviors after applying for the workaround, commenting here:
In the custom controller, the requeue interval after setting to "Retry" is 5 seconds, however from the log, I see 28 times in 2 seconds the reconciler tries to set the AC status from Pending to Retry. Seems like although Kueue evicts workload after AC is in retry status, it un-evicts it and reserves quota right away, so the status is back to Pending, Then Reconciler sets it back to Retry... It's like a race condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants