-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TAS: support rank-ordering for Pods #3533
Comments
/assign |
An extension of this we hear would be useful for our users is to support PodGroups which are managed and indexed by an external controller. I think that the current (label lookup-based in TopologyUngater, here) mechanism creates the incentive to support this by adding k8s-reserved labels to the custom controllers, which is not healthy. We need a TAS KEP extension for that, and @PBundyra agreed tentatively to work on it. |
/assign @PBundyra |
/assign @mbobrovskyi |
What would you like to be added:
For Jobs which provide indexing (like batch/Job) we should place Pods with consecutive indexes (ranks) should be placed as close as possible in the topology tree.
The current implementation places pods pretty much randomly (as they show up in the API server).
Example, we have a jobs with 10pods: 0,1,2,3,4,5,6,7,8,9. We have 3 racks, each with 4 slots.
Why is this needed:
For improved performance of network communication between pods. This is especially important for AI/ML frameworks, where the pods exchange data in the ring structure (like in NCCL).
It is part of #3450
The text was updated successfully, but these errors were encountered: