Skip to content

Conversation

JasonLi1909
Copy link
Contributor

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook.

@JasonLi1909 JasonLi1909 requested a review from a team as a code owner August 27, 2025 19:56
@JasonLi1909 JasonLi1909 requested a review from justinvyu August 27, 2025 19:57
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a resource leak by ensuring placement groups are removed when a train run is aborted. The change adds a call to _worker_group_state.shutdown() in the abort method, which is the right approach.

My review includes two main points for improvement:

  1. After aborting, the WorkerGroup object is left in an inconsistent state. I've suggested adding a call to _clear_state() to resolve this.
  2. A TODO comment has become outdated due to the change and should be updated for clarity and maintainability.

Overall, the change is in the right direction to fix the bug, and with these adjustments, it will be more robust.

…oup.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jason Li <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 477 to 480
# TODO: consider shutting down the workers in the future.
# We don't do this for now due to this risk of hanging e.g. when calling
# `destroy_process_group` on an active group.
# `destroy_process_group` on an active group. A solution is to use a timeout
# in TorchConfig.on_shutdown in case of a hang.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"TODO: add shutdown callback hooks"

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Aug 28, 2025
@@ -476,11 +476,15 @@ def abort(self):
"""Abort the worker group."""
# TODO: consider shutting down the workers in the future.
# We don't do this for now due to this risk of hanging e.g. when calling
# `destroy_process_group` on an active group.
# `destroy_process_group` on an active group. A solution is to use a timeout
Copy link
Contributor

@TimothySeah TimothySeah Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I think consider shutting down the workers in the future. is no longer applicable because worker_group_state.shutdown does do that right? Do we need to fix the destroy_process_group on an active group issue in this PR too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the catch! I will remove the comment regarding the worker shutdown. As for the destroy_process_goup on an active group that is not triggered unless we also perform the before_worker_group_abort callbacks so it will not be included in this PR

@JasonLi1909 JasonLi1909 added the go add ONLY when ready to merge, run all tests label Aug 29, 2025
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +272 to +273
# TODO: Add a timeout in the case of a hang, particularly
# relevant when func is TorchConfig.on_shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this, Tim added the timeout at a different layer

#56182

@justinvyu justinvyu enabled auto-merge (squash) September 3, 2025 18:41
@justinvyu justinvyu merged commit a040e6a into ray-project:master Sep 3, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests train Ray Train Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants