-
Notifications
You must be signed in to change notification settings - Fork 615
[POC] Prototype multi-host indexing #3998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
8a74046
to
a6b94b3
Compare
@ryanaoleary PTAL when you get the chance. |
@@ -27,6 +27,10 @@ const ( | |||
NumWorkerGroupsKey = "ray.io/num-worker-groups" | |||
KubeRayVersion = "ray.io/kuberay-version" | |||
|
|||
// Labels for feature RayMultihostIndexing | |||
RayWorkerReplicaIndexKey = "ray.io/worker-group-replica-id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be the same as their Raylet label equivalents?
ray.io/worker-group-replica-id
->ray.io/tpu-slice-name
ray.io/host-index
->ray.io/tpu-worker-id
What do you think @andrewsykim ?
@@ -1328,7 +1328,7 @@ func TestDefaultInitContainerImagePullPolicy(t *testing.T) { | |||
// set ray container imagePullPolicy | |||
worker.Template.Spec.Containers[utils.RayContainerIndex].ImagePullPolicy = tc.imagePullPolicy | |||
|
|||
podTemplateSpec := DefaultWorkerPodTemplate(ctx, *cluster, *worker.DeepCopy(), podName, fqdnRayIP, "6379") | |||
podTemplateSpec := DefaultWorkerPodTemplate(ctx, *cluster, *worker.DeepCopy(), podName, fqdnRayIP, "6379", "", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test with replicaGrpName
set and numHostIndex
> 1
It'd be good to make clear the value of this PR. Currently host and replica indexing for multi-host workers occurs in a separate GKE webhook that injects these values as env vars and a k8s label. The env vars and This PR moves the logic for indexing KubeRay worker Pods that request TPU from the webhook to KubeRay itself. By assigning indices as k8s Pod labels directly from KubeRay when they are created, we avoid the necessity for complicated logic in the TPU webhook that tracks the state of multi-host replicas in a RayCluster using a PodInformer. Since these variables are already used in Ray core and libraries like Train to handle the multi-host case, it makes sense to consolidate the logic in KubeRay. Additionally, since KubeRay is aware of when Pods are deleted, it becomes easier to scale-down multi-host replicas atomically. Overall, this PR is consolidating logic that is currently spread across the TPU webhook, KubeRay, and Ray core. The next step after this PR would be to move the environment variable injection that occurs in the TPU webhook to Ray core when the Raylet is started on a node. The worker lifecycle would then look as follows for multi-host workers:
|
a6b94b3
to
6935b9e
Compare
Why are these changes needed?
Part of #3902. POC, Adds group indexing and host index to multihosted workers.
Related issue number
For: #3902
Checks