Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS: support rank-ordering for Pods #3533

Open
Tracked by #3599
mimowo opened this issue Nov 14, 2024 · 5 comments
Open
Tracked by #3599

TAS: support rank-ordering for Pods #3533

mimowo opened this issue Nov 14, 2024 · 5 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mimowo
Copy link
Contributor

mimowo commented Nov 14, 2024

What would you like to be added:

For Jobs which provide indexing (like batch/Job) we should place Pods with consecutive indexes (ranks) should be placed as close as possible in the topology tree.

The current implementation places pods pretty much randomly (as they show up in the API server).

Example, we have a jobs with 10pods: 0,1,2,3,4,5,6,7,8,9. We have 3 racks, each with 4 slots.

  • Current possible ordering: [1,4,5,7][0,3,8,9][6,2] - suboptimal because communication 0-1,1-2, 2-3,3-4,5-6,6-7,7-8 cross the rack boundary and so will be slow
  • Wanted: [0,1,2,3][4,5,6,7][8,9] - optimal, only 0-9,3-4,7-8 cross the rank boundary

Why is this needed:

For improved performance of network communication between pods. This is especially important for AI/ML frameworks, where the pods exchange data in the ring structure (like in NCCL).

It is part of #3450

@mimowo mimowo added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 14, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Nov 14, 2024

/assign

@mimowo
Copy link
Contributor Author

mimowo commented Nov 14, 2024

@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2024

An extension of this we hear would be useful for our users is to support PodGroups which are managed and indexed by an external controller.

I think that the current (label lookup-based in TopologyUngater, here) mechanism creates the incentive to support this by adding k8s-reserved labels to the custom controllers, which is not healthy.
To resolve this issue I propose API at the workload level in PodSetTopologyRequest, called podIndexLabel (or 2 more to also abstract JobSet, jobIdexLabel and replicatedJobCount). With this API the implemetation of the GenericJob interface will set the values depending on the framework.
For pod groups we could have a label reserved by kueue, like kueue.x-k8s.io/pod-group-index (or like that).

We need a TAS KEP extension for that, and @PBundyra agreed tentatively to work on it.

cc @tenzen-y @mwysokin @mwielgus

@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2024

/assign @PBundyra
for the pod groups support and the generalized API

@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2024

/assign @mbobrovskyi
for the Kubeflow indexes support. For now just lookup the pod index label in the TopologyUngater around here. It will be later generalized by the new API + e2e test for kubeflow indexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants