[core][autoscaler][IPPR] Initial implementation for resizing pods in-place to the maximum configured by the user #55961

rueian · 2025-08-26T18:37:04Z

In-place Pod Resizing (IPPR) Integration

Things that have been done:

IPPR JSON configuration and validation for users to enable the IPPR integration with Autoscaler v2.
Resize a Pod's CPU and memory resource requests and limits to the maximums specified in one step by the config.

Configuration and Validation

Users can provide a ray.io/ippr annotation on their RayCluster CR to enable IPPR with Autoscaler v2:

{
  "groups": {
    "<groupName>": {
      "max-cpu":     string|number,  # K8s quantity (e.g. "2", "1500m")
      "max-memory":  string|integer, # K8s quantity (e.g. "8Gi", 2147483648)
      "resize-timeout": integer      # Seconds to wait for a pod resize to
                                     # complete before considering it timed out
    },
    ...
  }
}

groupName should match the names of Ray worker groups. In each group, max-cpu, max-memory, and resize-timeout are mandatory.

Besides the above configuration, we also validate:

The corresponding worker groups can't have num-cpus and memory in their rayStartParams because they can cause Ray logical resource mismatch with pod resources.
Worker groups should also have cpu and memory resource requests specified in their container specs.
In addition, their container should have resizePolicy.restartPolicy set to NotRequired.

Resize Behavior

The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is

After filling pending tasks to the existing nodes with their current capacities, the autoscaler will try to fill the remaining pending tasks to those nodes that have no ongoing resize again, but with their maximum capacities specified in the ray.io/ippr annotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation, ray.io/ippr-status, at the end of the current reconciliation.
If there are still pending tasks left, the autoscaler will do the original horizontal scale out, but with the maximum capacity of each worker type in consideration.
At the beginning of the next reconciliation, the autoscaler will determine the next step for those resize that have been sent at the end of the previous reconciliation by looking into their statuses. The next step can be two cases:
a) Finish the resize by adjusting the logical resources on the Raylet and update its ray.io/ippr-status.
b) Adjust the resize by queueing a new k8s resize request due to a timeout or an error.
Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding ray.io/ippr-status.

Additional notes

IPPR is Kubernetes-specific right now. We may need to revisit the current IPPRSpecs and IPPRStatus structures that are transferred between the scheduler and providers when a similar resizing capability comes to VMs.
The current implementation uses grpcio to connect to raylet. We may want to switch to using a Cython binding instead.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Copilot

Pull Request Overview

This PR introduces initial support for In-Place Pod Resize (IPPR) functionality in the Ray autoscaler for KubeRay clusters. IPPR allows pods to be resized without termination, improving resource utilization and reducing scheduling overhead by dynamically adjusting CPU and memory allocations based on demand.

Key changes:

Adds IPPR schema validation and typed data structures for group specifications and pod status tracking
Implements IPPR provider for KubeRay to handle resize requests and synchronization with Raylets
Integrates IPPR logic into the resource demand scheduler to prefer in-place resizing over launching new nodes

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`python/ray/autoscaler/v2/schema.py`	Defines IPPR data structures including IPPRSpecs, IPPRGroupSpec, and IPPRStatus
`python/ray/autoscaler/v2/scheduler.py`	Integrates IPPR into scheduling logic to consider resizing existing pods before launching new ones
`python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py`	New provider implementing IPPR operations including validation, pod resizing, and Raylet synchronization
`python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py`	Integrates IPPR provider into KubeRay cloud provider
`python/ray/autoscaler/v2/instance_manager/reconciler.py`	Connects IPPR functionality to the main autoscaler reconciliation loop
`python/ray/autoscaler/v2/tests/test_ippr_provider.py`	Comprehensive test suite for IPPR provider functionality
`python/ray/autoscaler/v2/tests/test_scheduler.py`	Tests for IPPR integration in the scheduler

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py

rueian · 2025-08-28T05:42:10Z

ci/env/install-core-prerelease-dependencies.sh

-# TODO(scv119) reenable grpcio once https://github.com/grpc/grpc/issues/31885 is fixed.
-# TODO(scv119) reenable jsonschema once https://github.com/ray-project/ray/issues/33411 is fixed.
-DEPS=(requests protobuf pytest-httpserver==1.1.3)
+DEPS=(requests protobuf pytest-httpserver==1.1.3 grpcio==1.74.0 jsonschema==4.23.0)


The IPPR implementation needs grpcio and jsonschema. The issue of jsonschema has been closed, and the issue of grpcio has been fixed by 1.74, according to grpc/grpc#31885 (comment).

rueian · 2025-08-28T05:43:20Z

ci/lint/pydoclint-baseline.txt

-    DOC001: Method `__init__` Potential formatting errors in docstring. Error message: No specification for "Args": ""
-    DOC001: Function/method `__init__`: Potential formatting errors in docstring. Error message: No specification for "Args": "" (Note: DOC001 could trigger other unrelated violations under this function/method too. Please fix the docstring formatting first.)
-    DOC101: Method `KubeRayProvider.__init__`: Docstring contains fewer arguments than in function signature.
-    DOC103: Method `KubeRayProvider.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [cluster_name: str, k8s_api_client: Optional[IKubernetesHttpApiClient], provider_config: Dict[str, Any]].


rueian · 2025-08-28T05:44:08Z

python/ray/autoscaler/_private/kuberay/node_provider.py

            url,
            json.dumps(payload),
-            headers={**headers, "Content-type": "application/json-patch+json"},
+            headers={**headers, "Content-type": content_type},


Make content-type adjustable for different patch strategies.

rueian · 2025-08-28T05:45:19Z

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

        self._ray_cluster = None
        self._cached_instances: Dict[CloudInstanceId, CloudInstance]
+        self._ippr_provider = KubeRayIPPRProvider(
+            gcs_client=gcs_client, k8s_api_client=self._k8s_api_client


The KubeRayIPPRProvider needs a gcs_client to query the port and the address of a Raylet, and it also needs a k8s_api_client to patch pods.

jackfrancis · 2025-09-04T23:48:01Z

I validated this on a 3 node (16 CPU cores each) cluster in Azure:

$ kubectl get pods -o='custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU:.spec.containers[0].resources.limits.cpu,CPU:.spec.containers[0].resources.requests.cpu' -w
NAMESPACE   NAME                                STATUS    CPU    CPU
default     kuberay-operator-79947594b8-zbklb   Running   100m   100m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-mtqx7-head-84n2m      Running   2      250m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Pending   1      500m
default     tpch-q1-sf-10-jw224                 Running   1      500m
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-m8675   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Pending   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   8      7
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13
default     tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj   Running   14     13

Using a ray-operator image built from this PR commit. You can see above the 3 long-lived pods, and their CPU requests/limits values increasing over time (without a lifecycle event creating a new pod).

tl;dr IPPR confirmed

edoakes · 2025-09-05T21:25:51Z

Still in my review queue, sorry haven't gotten to it yet (it's a big one!)

@jjyao can you help review as well? I need to re-read a lot of autoscaler code.

rueian · 2025-09-05T22:01:22Z

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

edoakes · 2025-09-05T22:42:57Z

Hi @edoakes @jjyao, the previous feedback I got is to replace grpcio with cython bindings, which I am currently working on, and I am also working on a new autoscaler document, which should also help walk through the autoscaler code. So, I think this PR is not in a hurry this week, but it would be really appreciated if I could get early feedback. 😃

Sounds good. I will do a quick scan then and hold off to dive into the details.

jackfrancis · 2025-09-18T21:21:24Z

@edoakes @jjyao @rueian anything I can do to help move this forward?

cc @marosset

jackfrancis · 2025-09-23T23:10:29Z

PSA: in-place is planned for graduation to GA in v1.35.0: kubernetes/enhancements#5562

github-actions · 2025-10-08T00:35:25Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

jackfrancis · 2025-10-09T17:53:32Z

@jjyao @edoakes bumping this to undo stale status

edoakes · 2025-10-10T15:53:29Z

@jackfrancis we are planning to pick this back up in the next month or so. It requires core changes to work end to end and @rueian is driving the project but is finishing the last semester of his masters project :)

rueian · 2025-10-10T16:14:14Z

Hi all, I will continue to work on this starting next week, rebasing, resolving conflicts!

…he maximum configured by the user Signed-off-by: Rueian <[email protected]>

cursor · 2025-10-19T06:10:04Z

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py

+                match = re.search(
+                    r"Node didn't have enough capacity: (cpu|memory), requested: (\d+), ()capacity: (\d+)",
+                    ippr_status.resized_message,
+                )


Bug: Pod Resize Regex Fails Capacity Capture

The regex pattern for "infeasible" pod resize messages includes an empty capture group, causing match.group(3) to be empty and match.group(4) to incorrectly capture capacity. This impacts the calculation of suggested maximum resources.

rueian force-pushed the autoscaler-ippr branch 3 times, most recently from 07edb0f to cd521c0 Compare August 26, 2025 20:55

rueian requested a review from Copilot August 27, 2025 00:10

Copilot AI reviewed Aug 27, 2025

View reviewed changes

rueian force-pushed the autoscaler-ippr branch from cd521c0 to 05b44b9 Compare August 27, 2025 00:21

rueian changed the title ~~[core][autoscaler][IPPR] Initial impl for resizing Pods in-place to the maximum configured by the user~~ [core][autoscaler][IPPR] Initial impl for resizing pods in-place to the maximum configured by the user Aug 27, 2025

rueian added the go add ONLY when ready to merge, run all tests label Aug 27, 2025

rueian force-pushed the autoscaler-ippr branch 4 times, most recently from f661d9b to 388ba37 Compare August 27, 2025 06:15

rueian changed the title ~~[core][autoscaler][IPPR] Initial impl for resizing pods in-place to the maximum configured by the user~~ [core][autoscaler][IPPR] Initial implementation for resizing pods in-place to the maximum configured by the user Aug 27, 2025

rueian force-pushed the autoscaler-ippr branch from f36b114 to c11febf Compare August 28, 2025 03:59

rueian commented Aug 28, 2025

View reviewed changes

rueian marked this pull request as ready for review August 28, 2025 16:07

rueian requested a review from a team as a code owner August 28, 2025 16:07

ray-gardener bot added core Issues that should be addressed in Ray Core kubernetes labels Aug 28, 2025

rueian mentioned this pull request Sep 1, 2025

[Umbrella] Autoscaler IPPR E2E test ray-project/kuberay#4028

Open

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 8, 2025

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Oct 10, 2025

rueian force-pushed the autoscaler-ippr branch from d4c46f4 to def88f1 Compare October 19, 2025 05:15

This comment was marked as outdated.

Sign in to view

rueian force-pushed the autoscaler-ippr branch 2 times, most recently from 54e5034 to 73b9cef Compare October 19, 2025 05:49

This comment was marked as outdated.

Sign in to view

rueian force-pushed the autoscaler-ippr branch from 73b9cef to 6aa0506 Compare October 19, 2025 06:02

[core][autoscaler][IPPR] Initial impl for resizing Pods in-place to t…

be300eb

…he maximum configured by the user Signed-off-by: Rueian <[email protected]>

rueian force-pushed the autoscaler-ippr branch from 6aa0506 to be300eb Compare October 19, 2025 06:04

cursor bot reviewed Oct 19, 2025

View reviewed changes

[core][autoscaler][IPPR] Initial implementation for resizing pods in-place to the maximum configured by the user #55961

Are you sure you want to change the base?

[core][autoscaler][IPPR] Initial implementation for resizing pods in-place to the maximum configured by the user #55961

Conversation

rueian commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In-place Pod Resizing (IPPR) Integration

Configuration and Validation

Resize Behavior

Additional notes

Related issue number

Checks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rueian Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

rueian Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

rueian Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

rueian Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Sep 4, 2025

Uh oh!

edoakes commented Sep 5, 2025

Uh oh!

rueian commented Sep 5, 2025

Uh oh!

edoakes commented Sep 5, 2025

Uh oh!

jackfrancis commented Sep 18, 2025

Uh oh!

jackfrancis commented Sep 23, 2025

Uh oh!

github-actions bot commented Oct 8, 2025

Uh oh!

jackfrancis commented Oct 9, 2025

Uh oh!

edoakes commented Oct 10, 2025

Uh oh!

rueian commented Oct 10, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Oct 19, 2025

Choose a reason for hiding this comment

Bug: Pod Resize Regex Fails Capacity Capture

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rueian commented Aug 26, 2025 •

edited

Loading