Skip to content

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Aug 26, 2025

In-place Pod Resizing (IPPR) Integration

Things that have been done:

  1. IPPR JSON configuration and validation for users to enable the IPPR integration with Autoscaler v2.
  2. Resize a Pod's CPU and memory resource requests and limits to the maximums specified in one step by the config.

Configuration and Validation

Users can provide a ray.io/ippr annotation on their RayCluster CR to enable IPPR with Autoscaler v2:

{
  "groups": {
    "<groupName>": {
      "max-cpu":     string|number,  # K8s quantity (e.g. "2", "1500m")
      "max-memory":  string|integer, # K8s quantity (e.g. "8Gi", 2147483648)
      "resize-timeout": integer      # Seconds to wait for a pod resize to
                                     # complete before considering it timed out
    },
    ...
  }
}

groupName should match the names of Ray worker groups. In each group, max-cpu, max-memory, and resize-timeout are mandatory.

Besides the above configuration, we also validate:

  1. The corresponding worker groups can't have num-cpus and memory in their rayStartParams because they can cause Ray logical resource mismatch with pod resources.
  2. Worker groups should also have cpu and memory resource requests specified in their container specs.
  3. In addition, their container should have resizePolicy.restartPolicy set to NotRequired.

Resize Behavior

The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is

  1. After filling pending tasks to the existing nodes with their current capacities, the autoscaler will try to fill the remaining pending tasks to those nodes that have no ongoing resize again, but with their maximum capacities specified in the ray.io/ippr annotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation, ray.io/ippr-status, at the end of the current reconciliation.
  2. If there are still pending tasks left, the autoscaler will do the original horizontal scale out, but with the maximum capacity of each worker type in consideration.
  3. At the beginning of the next reconciliation, the autoscaler will determine the next step for those resize that have been sent at the end of the previous reconciliation by looking into their statuses. The next step can be two cases:
    a) Finish the resize by adjusting the logical resources on the Raylet and update its ray.io/ippr-status.
    b) Adjust the resize by queueing a new k8s resize request due to a timeout or an error.
    Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding ray.io/ippr-status.

Additional notes

  1. IPPR is Kubernetes-specific right now. We may need to revisit the current IPPRSpecs and IPPRStatus structures that are transferred between the scheduler and providers when a similar resizing capability comes to VMs.
  2. The current implementation uses grpcio to connect to raylet. We may want to switch to using a Cython binding instead.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rueian rueian force-pushed the autoscaler-ippr branch 3 times, most recently from 07edb0f to cd521c0 Compare August 26, 2025 20:55
@rueian rueian requested a review from Copilot August 27, 2025 00:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial support for In-Place Pod Resize (IPPR) functionality in the Ray autoscaler for KubeRay clusters. IPPR allows pods to be resized without termination, improving resource utilization and reducing scheduling overhead by dynamically adjusting CPU and memory allocations based on demand.

Key changes:

  • Adds IPPR schema validation and typed data structures for group specifications and pod status tracking
  • Implements IPPR provider for KubeRay to handle resize requests and synchronization with Raylets
  • Integrates IPPR logic into the resource demand scheduler to prefer in-place resizing over launching new nodes

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/ray/autoscaler/v2/schema.py Defines IPPR data structures including IPPRSpecs, IPPRGroupSpec, and IPPRStatus
python/ray/autoscaler/v2/scheduler.py Integrates IPPR into scheduling logic to consider resizing existing pods before launching new ones
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py New provider implementing IPPR operations including validation, pod resizing, and Raylet synchronization
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Integrates IPPR provider into KubeRay cloud provider
python/ray/autoscaler/v2/instance_manager/reconciler.py Connects IPPR functionality to the main autoscaler reconciliation loop
python/ray/autoscaler/v2/tests/test_ippr_provider.py Comprehensive test suite for IPPR provider functionality
python/ray/autoscaler/v2/tests/test_scheduler.py Tests for IPPR integration in the scheduler

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@rueian rueian changed the title [core][autoscaler][IPPR] Initial impl for resizing Pods in-place to the maximum configured by the user [core][autoscaler][IPPR] Initial impl for resizing pods in-place to the maximum configured by the user Aug 27, 2025
@rueian rueian added the go add ONLY when ready to merge, run all tests label Aug 27, 2025
@rueian rueian force-pushed the autoscaler-ippr branch 4 times, most recently from f661d9b to 388ba37 Compare August 27, 2025 06:15
@rueian rueian changed the title [core][autoscaler][IPPR] Initial impl for resizing pods in-place to the maximum configured by the user [core][autoscaler][IPPR] Initial implementation for resizing pods in-place to the maximum configured by the user Aug 27, 2025
# TODO(scv119) reenable grpcio once https://github.com/grpc/grpc/issues/31885 is fixed.
# TODO(scv119) reenable jsonschema once https://github.com/ray-project/ray/issues/33411 is fixed.
DEPS=(requests protobuf pytest-httpserver==1.1.3)
DEPS=(requests protobuf pytest-httpserver==1.1.3 grpcio==1.74.0 jsonschema==4.23.0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPPR implementation needs grpcio and jsonschema. The issue of jsonschema has been closed, and the issue of grpcio has been fixed by 1.74, according to grpc/grpc#31885 (comment).

DOC001: Method `__init__` Potential formatting errors in docstring. Error message: No specification for "Args": ""
DOC001: Function/method `__init__`: Potential formatting errors in docstring. Error message: No specification for "Args": "" (Note: DOC001 could trigger other unrelated violations under this function/method too. Please fix the docstring formatting first.)
DOC101: Method `KubeRayProvider.__init__`: Docstring contains fewer arguments than in function signature.
DOC103: Method `KubeRayProvider.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [cluster_name: str, k8s_api_client: Optional[IKubernetesHttpApiClient], provider_config: Dict[str, Any]].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix lint.

url,
json.dumps(payload),
headers={**headers, "Content-type": "application/json-patch+json"},
headers={**headers, "Content-type": content_type},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make content-type adjustable for different patch strategies.

self._ray_cluster = None
self._cached_instances: Dict[CloudInstanceId, CloudInstance]
self._ippr_provider = KubeRayIPPRProvider(
gcs_client=gcs_client, k8s_api_client=self._k8s_api_client
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KubeRayIPPRProvider needs a gcs_client to query the port and the address of a Raylet, and it also needs a k8s_api_client to patch pods.

@rueian rueian marked this pull request as ready for review August 28, 2025 16:07
@rueian rueian requested a review from a team as a code owner August 28, 2025 16:07
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core kubernetes labels Aug 28, 2025
@jackfrancis
Copy link
Contributor

I validated this on a 3 node (16 CPU cores each) cluster in Azure:

$ kubectl get pods -o='custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU:.spec.containers[0].resources.limits.cpu,CPU:.spec.containers[0].resources.requests.cpu' -w
NAMESPACE   NAME                                STATUS    CPU    CPU
default     kuberay-operator-79947594b8-zbklb   Running   100m   100m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Running   1      500m
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13

Using a ray-operator image built from this PR commit. You can see above the 3 long-lived pods, and their CPU requests/limits values increasing over time (without a lifecycle event creating a new pod).

tl;dr IPPR confirmed

@edoakes
Copy link
Collaborator

edoakes commented Sep 5, 2025

Still in my review queue, sorry haven't gotten to it yet (it's a big one!)

@jjyao can you help review as well? I need to re-read a lot of autoscaler code.

@rueian
Copy link
Contributor Author

rueian commented Sep 5, 2025

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

@edoakes
Copy link
Collaborator

edoakes commented Sep 5, 2025

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

Sounds good. I will do a quick scan then and hold off to dive into the details.

@jackfrancis
Copy link
Contributor

@edoakes @jjyao @rueian anything I can do to help move this forward?

cc @marosset

@jackfrancis
Copy link
Contributor

PSA: in-place is planned for graduation to GA in v1.35.0: kubernetes/enhancements#5562

Copy link

github-actions bot commented Oct 8, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 8, 2025
@jackfrancis
Copy link
Contributor

@jjyao @edoakes bumping this to undo stale status

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Oct 10, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 10, 2025

@jackfrancis we are planning to pick this back up in the next month or so. It requires core changes to work end to end and @rueian is driving the project but is finishing the last semester of his masters project :)

@rueian
Copy link
Contributor Author

rueian commented Oct 10, 2025

Hi all, I will continue to work on this starting next week, rebasing, resolving conflicts!

cursor[bot]

This comment was marked as outdated.

@rueian rueian force-pushed the autoscaler-ippr branch 2 times, most recently from 54e5034 to 73b9cef Compare October 19, 2025 05:49
cursor[bot]

This comment was marked as outdated.

…he maximum configured by the user

Signed-off-by: Rueian <[email protected]>
match = re.search(
r"Node didn't have enough capacity: (cpu|memory), requested: (\d+), ()capacity: (\d+)",
ippr_status.resized_message,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Pod Resize Regex Fails Capacity Capture

The regex pattern for "infeasible" pod resize messages includes an empty capture group, causing match.group(3) to be empty and match.group(4) to incorrectly capture capacity. This impacts the calculation of suggested maximum resources.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests kubernetes unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants