-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core][autoscaler][IPPR] Initial implementation for resizing pods in-place to the maximum configured by the user #55961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
07edb0f
to
cd521c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces initial support for In-Place Pod Resize (IPPR) functionality in the Ray autoscaler for KubeRay clusters. IPPR allows pods to be resized without termination, improving resource utilization and reducing scheduling overhead by dynamically adjusting CPU and memory allocations based on demand.
Key changes:
- Adds IPPR schema validation and typed data structures for group specifications and pod status tracking
- Implements IPPR provider for KubeRay to handle resize requests and synchronization with Raylets
- Integrates IPPR logic into the resource demand scheduler to prefer in-place resizing over launching new nodes
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
python/ray/autoscaler/v2/schema.py |
Defines IPPR data structures including IPPRSpecs, IPPRGroupSpec, and IPPRStatus |
python/ray/autoscaler/v2/scheduler.py |
Integrates IPPR into scheduling logic to consider resizing existing pods before launching new ones |
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py |
New provider implementing IPPR operations including validation, pod resizing, and Raylet synchronization |
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py |
Integrates IPPR provider into KubeRay cloud provider |
python/ray/autoscaler/v2/instance_manager/reconciler.py |
Connects IPPR functionality to the main autoscaler reconciliation loop |
python/ray/autoscaler/v2/tests/test_ippr_provider.py |
Comprehensive test suite for IPPR provider functionality |
python/ray/autoscaler/v2/tests/test_scheduler.py |
Tests for IPPR integration in the scheduler |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py
Show resolved
Hide resolved
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/ippr_provider.py
Show resolved
Hide resolved
cd521c0
to
05b44b9
Compare
f661d9b
to
388ba37
Compare
f36b114
to
c11febf
Compare
# TODO(scv119) reenable grpcio once https://github.com/grpc/grpc/issues/31885 is fixed. | ||
# TODO(scv119) reenable jsonschema once https://github.com/ray-project/ray/issues/33411 is fixed. | ||
DEPS=(requests protobuf pytest-httpserver==1.1.3) | ||
DEPS=(requests protobuf pytest-httpserver==1.1.3 grpcio==1.74.0 jsonschema==4.23.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IPPR implementation needs grpcio and jsonschema. The issue of jsonschema has been closed, and the issue of grpcio has been fixed by 1.74, according to grpc/grpc#31885 (comment).
DOC001: Method `__init__` Potential formatting errors in docstring. Error message: No specification for "Args": "" | ||
DOC001: Function/method `__init__`: Potential formatting errors in docstring. Error message: No specification for "Args": "" (Note: DOC001 could trigger other unrelated violations under this function/method too. Please fix the docstring formatting first.) | ||
DOC101: Method `KubeRayProvider.__init__`: Docstring contains fewer arguments than in function signature. | ||
DOC103: Method `KubeRayProvider.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [cluster_name: str, k8s_api_client: Optional[IKubernetesHttpApiClient], provider_config: Dict[str, Any]]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix lint.
url, | ||
json.dumps(payload), | ||
headers={**headers, "Content-type": "application/json-patch+json"}, | ||
headers={**headers, "Content-type": content_type}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make content-type adjustable for different patch strategies.
self._ray_cluster = None | ||
self._cached_instances: Dict[CloudInstanceId, CloudInstance] | ||
self._ippr_provider = KubeRayIPPRProvider( | ||
gcs_client=gcs_client, k8s_api_client=self._k8s_api_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KubeRayIPPRProvider needs a gcs_client to query the port and the address of a Raylet, and it also needs a k8s_api_client to patch pods.
I validated this on a 3 node (16 CPU cores each) cluster in Azure: $ kubectl get pods -o='custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU:.spec.containers[0].resources.limits.cpu,CPU:.spec.containers[0].resources.requests.cpu' -w
NAMESPACE NAME STATUS CPU CPU
default kuberay-operator-79947594b8-zbklb Running 100m 100m
default tpch-q1-sf-10-mtqx7-head-84n2m Running 2 250m
default tpch-q1-sf-10-mtqx7-head-84n2m Running 2 250m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Pending 1 500m
default tpch-q1-sf-10-jw224 Running 1 500m
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-m8675 Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Pending 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 8 7
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-6s6qf Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13
default tpch-q1-sf-10-mtqx7-small-group-worker-kfzvj Running 14 13 Using a ray-operator image built from this PR commit. You can see above the 3 long-lived pods, and their CPU requests/limits values increasing over time (without a lifecycle event creating a new pod). tl;dr IPPR confirmed |
Still in my review queue, sorry haven't gotten to it yet (it's a big one!) @jjyao can you help review as well? I need to re-read a lot of autoscaler code. |
Hi @edoakes @jjyao, the previous feedback I got is to replace |
Sounds good. I will do a quick scan then and hold off to dive into the details. |
PSA: in-place is planned for graduation to GA in v1.35.0: kubernetes/enhancements#5562 |
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
@jackfrancis we are planning to pick this back up in the next month or so. It requires core changes to work end to end and @rueian is driving the project but is finishing the last semester of his masters project :) |
Hi all, I will continue to work on this starting next week, rebasing, resolving conflicts! |
d4c46f4
to
def88f1
Compare
54e5034
to
73b9cef
Compare
73b9cef
to
6aa0506
Compare
…he maximum configured by the user Signed-off-by: Rueian <[email protected]>
6aa0506
to
be300eb
Compare
match = re.search( | ||
r"Node didn't have enough capacity: (cpu|memory), requested: (\d+), ()capacity: (\d+)", | ||
ippr_status.resized_message, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-place Pod Resizing (IPPR) Integration
Things that have been done:
Configuration and Validation
Users can provide a
ray.io/ippr
annotation on their RayCluster CR to enable IPPR with Autoscaler v2:groupName
should match the names of Ray worker groups. In each group,max-cpu
,max-memory
, andresize-timeout
are mandatory.Besides the above configuration, we also validate:
num-cpus
andmemory
in theirrayStartParams
because they can cause Ray logical resource mismatch with pod resources.cpu
andmemory
resource requests specified in their container specs.resizePolicy.restartPolicy
set toNotRequired
.Resize Behavior
The current implementation will try to resize the existing nodes to the maximum specified by the user in one step if there are pending tasks that can fit on those nodes after resizing. We will implement gradual resizing and downsizing in later PRs. The detailed behavior is
ray.io/ippr
annotation. If there are remaining pending tasks that can be fit on a node, the autoscaler will send its k8s resize request and record the resize status in a pod annotation,ray.io/ippr-status
, at the end of the current reconciliation.a) Finish the resize by adjusting the logical resources on the Raylet and update its
ray.io/ippr-status
.b) Adjust the resize by queueing a new k8s resize request due to a timeout or an error.
Note that if the RPC to adjust the logical resources on the Raylet fails, the autoscaler will retry again in the next reconciliation because it doesn't update the corresponding
ray.io/ippr-status
.Additional notes
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.