Skip to content

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jul 28, 2025

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @rueian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a Proof of Concept (PoC) for in-place resource provisioning and resizing within the Ray cluster, aiming to enable dynamic adjustment of resources on existing nodes. This capability allows for more flexible and efficient resource management without requiring node restarts or additions/removals. The changes involve adding new RPC capabilities to Raylets, enhancing the Global Control Store (GCS) to propagate these resource changes, and improving the consistency of GCS client subscriptions.

Highlights

  • Dynamic Resource Resizing RPC: A new RPC endpoint, ResizeLocalResourceInstances, has been added to the Raylet's NodeManagerService. This enables external components, such as an autoscaler, to dynamically request changes to a node's total CPU and memory resources.
  • Raylet-side Resource Management: The Raylet's NodeManager now implements the logic to handle these resource resize requests. It includes validation to prevent resizing of unit-instance resources (like GPUs) and to ensure that downsizing does not result in negative available resources. Upon successful resizing, it updates its internal resource scheduler and triggers task rescheduling.
  • GCS Notification of Resource Changes: The Global Control Store (GCS) NodeManager has been enhanced to detect and publish updates to a node's total resources. When a Raylet's resources are resized, this change is immediately broadcasted to the GCS, making the updated resource view available to subscribers.
  • GCS Client Subscription Consistency: A potential race condition in the GCS client's node information subscription mechanism has been addressed. This ensures that initial node data population and subsequent real-time updates are processed consistently, preventing data inconsistencies by queuing updates received before the initial data is fully processed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a proof-of-concept for in-place resource resizing (IPPR) in the autoscaler. It adds a new RPC ResizeLocalResourceInstances to the NodeManagerService to dynamically adjust a node's resources. The changes include protobuf definitions for the new RPC, implementation in raylet/node_manager.cc to handle resource resizing, updates to gcs_node_manager.cc to propagate resource changes, and a fix for a race condition in gcs_client/accessor.cc during node info subscription. The new functionality is also well-tested.

The overall approach is sound. My feedback focuses on improving maintainability by reducing code duplication and enhancing logging for better observability.

Comment on lines 534 to 563
auto resources_total = maybe_node_info.value()->mutable_resources_total();
auto &new_resources = resource_view_sync_message.resources_total();
// Check if resources_total has changed
if (!MapEqual(*resources_total, new_resources)) {
// Update total resources
*resources_total = new_resources;
// Publish the updated node info to notify subscribers
auto status =
gcs_publisher_->PublishNodeInfo(node_id, *maybe_node_info.value(), nullptr);
if (!status.ok()) {
RAY_LOG(WARNING).WithField(node_id)
<< "Failed to publish node info update: " << status.ToString();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and avoid multiple calls to maybe_node_info.value(), it's better to get the GcsNodeInfo object once and reuse it.

  auto &node_info = *maybe_node_info.value();
  auto resources_total = node_info.mutable_resources_total();
  auto &new_resources = resource_view_sync_message.resources_total();
  // Check if resources_total has changed
  if (!MapEqual(*resources_total, new_resources)) {
    // Update total resources
    *resources_total = new_resources;
    // Publish the updated node info to notify subscribers
    auto status =
        gcs_publisher_->PublishNodeInfo(node_id, node_info, nullptr);
    if (!status.ok()) {
      RAY_LOG(WARNING).WithField(node_id)
          << "Failed to publish node info update: " << status.ToString();
    }
  }

Comment on lines 1972 to 1946
// Log the updated resources
RAY_LOG(INFO) << "Successfully resized local resources. Current Total resources:";
for (const auto &[resource_name, updated_value] : updated_total_map) {
RAY_LOG(INFO) << " " << resource_name << ": " << updated_value;
}
RAY_LOG(INFO) << "Available resources:";
for (const auto &[resource_name, updated_value] : updated_available_map) {
RAY_LOG(INFO) << " " << resource_name << ": " << updated_value;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logging for updated resources iterates through each resource and logs it on a new line. This can be very verbose and spammy in logs, especially when a node has many resource types. Consider formatting the resource maps into a single, more compact log line for better log readability and parsing.

For example:

auto format_map = [](const auto& map) {
    std::stringstream ss;
    ss << "{";
    for (auto it = map.begin(); it != map.end(); ++it) {
        ss << it->first << ": " << it->second;
        if (std::next(it) != map.end()) {
            ss << ", ";
        }
    }
    ss << "}";
    return ss.str();
};
RAY_LOG(INFO) << "Successfully resized local resources. "
              << "Current Total: " << format_map(updated_total_map)
              << ", Available: " << format_map(updated_available_map);

@rueian rueian force-pushed the ippr-poc branch 10 times, most recently from 37b8077 to faa0ad1 Compare August 4, 2025 22:14
rueian and others added 11 commits August 12, 2025 13:52
Signed-off-by: Rueian <[email protected]>

[core][autoscaler] add HandleResizeLocalResourceInstances

Signed-off-by: Rueian <[email protected]>

[core][autoscaler] add HandleResizeLocalResourceInstances

Signed-off-by: Rueian <[email protected]>

[core][autoscaler] add HandleResizeLocalResourceInstances

Signed-off-by: Rueian <[email protected]>
@rueian rueian force-pushed the ippr-poc branch 3 times, most recently from b45d641 to 20205c4 Compare August 13, 2025 18:30
@rueian rueian force-pushed the ippr-poc branch 2 times, most recently from 1500719 to 963c18c Compare August 20, 2025 04:54
Signed-off-by: Rueian <[email protected]>
rueian added 12 commits August 20, 2025 19:20
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
Copy link

github-actions bot commented Sep 9, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant