Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray scheduling] The memory already used on the Worker Node needs to be taken into account when scheduling Ray tasks #45196

Open
yx367563 opened this issue May 8, 2024 · 0 comments
Assignees
Labels
core Issues that should be addressed in Ray Core core-scheduler enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@yx367563
Copy link

yx367563 commented May 8, 2024

Description

Currently, when Ray schedules a task, it only takes into account the memory resources requested by the user in the option.
This can lead to the possibility of scheduling multiple memory-hungry tasks to a single worker node if the task doesn't specify a memory parameter, which can trigger an OOM. Even with the retry mechanism, there's no way to ensure which worker node will be allocated in the next scheduling.
So it needs to take into account the actual memory resources already used on the worker node.

Use case

Users may not set the memory parameter when submitting a Ray Task, or they may not be sure how much memory the task will consume when running. Therefore, if we only rely on the memory requested by user for scheduling, it is very likely that multiple tasks with high memory consumption will be scheduled to a single Worker Node. After an OOM retry is triggered, the task may continue to be scheduled to the original Worker Node because the requested memory parameters have not changed.

Possible Solution: Consider the memory already used on the Worker Node when scheduling the task, and expand the memory requested by the task when triggering the OOM retry in conjunction with the size of the memory it previously used.

@yx367563 yx367563 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 8, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 13, 2024
@jjyao jjyao added core-scheduler P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core core-scheduler enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

3 participants