Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] deviceshare plugin not handle AddPod\RemovePod correctly #1959

Open
buptcozy opened this issue Mar 18, 2024 · 1 comment
Open

[BUG] deviceshare plugin not handle AddPod\RemovePod correctly #1959

buptcozy opened this issue Mar 18, 2024 · 1 comment
Labels
kind/bug Create a report to help us improve
Milestone

Comments

@buptcozy
Copy link
Contributor

What happened:

in generally, when we execute AddPod logic here, the pod may be in scheduling status,it won't exist in nodeDeviceCache's used map, so there is a bug that when the framework execute RunFilterPluginsWithNominatedPods with AddPod for high priority pods, the plugin can't reserve resource for hese high priority pods, In RDMA\VF\nv-switch scenario, it can cause high priority pods assign fail due to some resources is assigned by low priority pods. So we reused the "Reserve" logic to generate an assign placement and save it in nominator. We will clear the nominator cache In "Reserve" and "UnReserve", which means we will do clean job no matter assign success or not, this is the same process of the origin k8s framework nominate process.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • App version:
  • Kubernetes version (use kubectl version):
  • Install details (e.g. helm install args):
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version:
    • OS version:
    • Kernal version:
    • Cgroup driver: cgroupfs/systemd
  • Others:
@buptcozy buptcozy added the kind/bug Create a report to help us improve label Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 18, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 20, 2024
buptcozy pushed a commit to buptcozy/koordinator that referenced this issue Mar 20, 2024
@ZiMengSheng
Copy link
Contributor

ZiMengSheng commented Mar 25, 2024

Problem Description

Let's analysize by following example.

  1. PodA with low priority, requested 8 gpu, scheduled to node1.
  2. PodB with high priority, requested 4 gpu, preempt PodA, expected to reserve 0-3 gpu, status.nominatedNodeName updated to node1, PodB enter into backoffQ.
  3. PodC with mid priority enter into scheduing cycle, requested 4 gpu, scheduled to node1 without considering PodB preemption result, so it may use 0-3 gpu unexpectedly.

Suggested Proposal

Let's design by the following examples.

Example1

  1. PodA with low priority, requested 8 gpu, scheduled to node1
  2. PodB with high priority, requested 4 gpu, preempt PodA, invoke ReserveNominatedPod(PodB) to reserve PodB's nominated resource: 0-3 gpu, status.nominatedNodeName updated to node1, enter into backoffQ.
  3. PodC with mid priority enter into scheduling cycle, requested 4 gpu. In filter phase, framework will invoke RunPreFilterExtensionAddPod(higher priorioty pod such as PodB). We have the chance to make PodB's nonimated resource reserved in current scheduling cycle here. So PodC can't use 0-3 gpu.

Example2

  1. PodA with low priority, requested 8 gpu, scheduled to node1
  2. PodB with high priority, requested 4 gpu, preempt PodA, invoke ReserveNominatedPod(PodB) to reserve PodB's nominated resource: 0-3 gpu, status.nominatedNodeName updated to node1, enter into backoffQ.
  3. PodC with high+ priority, requested 4 gpu, scheduled to node1, normally allocated resource: 0-3 gpu. It is overlap with PodB's nominated resource. So, here, we need to invalidate PodB's outdated nominated resource. This make this fix a best-effort.

Scheduling Interpretability

  1. We need sufficient metric or debug service to help us diagnosize and illustrate to users when pod is pending.

@ZiMengSheng ZiMengSheng added this to the v1.6 milestone May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Create a report to help us improve
Projects
None yet
Development

No branches or pull requests

2 participants