-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2834] [shim] Add non-YuniKorn allocation tracking logic #918
Conversation
6152e8c
to
5bdaff9
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #918 +/- ##
==========================================
+ Coverage 68.41% 68.65% +0.23%
==========================================
Files 70 70
Lines 7653 7675 +22
==========================================
+ Hits 5236 5269 +33
+ Misses 2208 2200 -8
+ Partials 209 206 -3 ☔ View full report in Codecov by Sentry. |
3c92e97
to
5a5c3ac
Compare
5a5c3ac
to
dd847af
Compare
I think we need to handle orphan pods for foreign pods similar to yunikorn ones. In other words, if a pod references a non-existent node, it needs to be "orphaned" and not sent to the core. If a node appears, any currently orphaned pods (foreign or otherwise) which reference the new node need to be adopted and registered with the core. |
Makes sense. I'll update the PR with these changes tomorrow. |
Also, I'm pretty sure the removal of the context lock broke the logic around ensuring orphan pod handling is done correctly (and is likely responsible for some of the resource mismatches in 1.6.0). This PR was where this was done: #859 I think we need to re-add the locks here. The context lock was not used to guard the context object itself, but to act as a "meta-lock" to ensure that we don't process node and pod events simultaneously. I believe the fix we did in 1.5.2 was sufficient, and we should not have removed the locking in #859. |
@craigcondit I looked at the code in more details, here are my findings:
Is there anything else? |
That seems pretty complete. I'd like to see some tests that verify correctness before and after each of these scenarios. The locking definitely needs to be there in the context regardless. |
Added tests for orphan pods. I didn't do anything with 4) because it's much safer to always send a release request. |
@pbacsko Can you rebase? YUNIKORN-2910 got merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
What is this PR for?
Add foreign pod handling logic to the shim. Existing code which deals with
occupiedResources
will be removed in a follow-up PR.What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2791
How should this be tested?
Screenshots (if appropriate)
Questions: