-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Remove Placement Group on Train Run Abort #56011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Placement Group on Train Run Abort #56011
Conversation
Signed-off-by: JasonLi1909 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a resource leak by ensuring placement groups are removed when a train run is aborted. The change adds a call to _worker_group_state.shutdown()
in the abort
method, which is the right approach.
My review includes two main points for improvement:
- After aborting, the
WorkerGroup
object is left in an inconsistent state. I've suggested adding a call to_clear_state()
to resolve this. - A
TODO
comment has become outdated due to the change and should be updated for clarity and maintainability.
Overall, the change is in the right direction to fix the bug, and with these adjustments, it will be more robust.
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
…oup.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, can you add to the ABORTED test here: https://github.com/anyscale/rayturbo/blob/master/python/ray/train/v2/tests/test_worker_group.py#L512
# TODO: consider shutting down the workers in the future. | ||
# We don't do this for now due to this risk of hanging e.g. when calling | ||
# `destroy_process_group` on an active group. | ||
# `destroy_process_group` on an active group. A solution is to use a timeout | ||
# in TorchConfig.on_shutdown in case of a hang. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"TODO: add shutdown callback hooks"
@@ -476,11 +476,15 @@ def abort(self): | |||
"""Abort the worker group.""" | |||
# TODO: consider shutting down the workers in the future. | |||
# We don't do this for now due to this risk of hanging e.g. when calling | |||
# `destroy_process_group` on an active group. | |||
# `destroy_process_group` on an active group. A solution is to use a timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I think consider shutting down the workers in the future.
is no longer applicable because worker_group_state.shutdown
does do that right? Do we need to fix the destroy_process_group on an active group
issue in this PR too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for the catch! I will remove the comment regarding the worker shutdown. As for the destroy_process_goup on an active group
that is not triggered unless we also perform the before_worker_group_abort
callbacks so it will not be included in this PR
Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: JasonLi1909 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
# TODO: Add a timeout in the case of a hang, particularly | ||
# relevant when func is TorchConfig.on_shutdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this, Tim added the timeout at a different layer
This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook.