Skip to content

Conversation

chiayi
Copy link
Contributor

@chiayi chiayi commented Aug 28, 2025

Why are these changes needed?

Part of #3902. POC, Adds group indexing and host index to multihosted workers.

Related issue number

For: #3902

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@chiayi chiayi force-pushed the multihost-indexing branch 2 times, most recently from 8a74046 to a6b94b3 Compare August 28, 2025 17:32
@chiayi
Copy link
Contributor Author

chiayi commented Aug 28, 2025

@ryanaoleary PTAL when you get the chance.

@@ -27,6 +27,10 @@ const (
NumWorkerGroupsKey = "ray.io/num-worker-groups"
KubeRayVersion = "ray.io/kuberay-version"

// Labels for feature RayMultihostIndexing
RayWorkerReplicaIndexKey = "ray.io/worker-group-replica-id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be the same as their Raylet label equivalents?

  • ray.io/worker-group-replica-id -> ray.io/tpu-slice-name
  • ray.io/host-index -> ray.io/tpu-worker-id

What do you think @andrewsykim ?

@@ -1328,7 +1328,7 @@ func TestDefaultInitContainerImagePullPolicy(t *testing.T) {
// set ray container imagePullPolicy
worker.Template.Spec.Containers[utils.RayContainerIndex].ImagePullPolicy = tc.imagePullPolicy

podTemplateSpec := DefaultWorkerPodTemplate(ctx, *cluster, *worker.DeepCopy(), podName, fqdnRayIP, "6379")
podTemplateSpec := DefaultWorkerPodTemplate(ctx, *cluster, *worker.DeepCopy(), podName, fqdnRayIP, "6379", "", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test with replicaGrpName set and numHostIndex > 1

@ryanaoleary
Copy link
Contributor

It'd be good to make clear the value of this PR.

Currently host and replica indexing for multi-host workers occurs in a separate GKE webhook that injects these values as env vars and a k8s label. The env vars and replicaIndex label are then read from within Ray to set Raylet labels and do things like atomically scale multi-host slices with autoscaling.

This PR moves the logic for indexing KubeRay worker Pods that request TPU from the webhook to KubeRay itself. By assigning indices as k8s Pod labels directly from KubeRay when they are created, we avoid the necessity for complicated logic in the TPU webhook that tracks the state of multi-host replicas in a RayCluster using a PodInformer. Since these variables are already used in Ray core and libraries like Train to handle the multi-host case, it makes sense to consolidate the logic in KubeRay. Additionally, since KubeRay is aware of when Pods are deleted, it becomes easier to scale-down multi-host replicas atomically. Overall, this PR is consolidating logic that is currently spread across the TPU webhook, KubeRay, and Ray core.

The next step after this PR would be to move the environment variable injection that occurs in the TPU webhook to Ray core when the Raylet is started on a node. The worker lifecycle would then look as follows for multi-host workers:

  1. Ray sends a request to scale N workers to satisfy resource requests for TPU
  2. KubeRay scales up a multi-host (NumOfHosts > 1) replica with N Pods (1 Pod per host) and indexes each worker using k8s labels
  3. When the Raylet is started on each worker Pod, the information in the k8s Pod spec (ray.io/tpu-worker-id and ray.io/tpu-slice-name labels) is used to set required Jax environment variables like TPU_WORKER_ID, TPU_WORKER_HOSTNAMES, and TPU_NAME and the corresponding Raylet node labels for label based scheduling.
  4. When a multi-host worker is deleted by KubeRay, we can check the ray.io/tpu-slice-name label to scale down the entire slice atomically.

@chiayi chiayi force-pushed the multihost-indexing branch from a6b94b3 to 6935b9e Compare August 29, 2025 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants