Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -476,11 +476,14 @@ def abort(self):
"""Abort the worker group."""
# TODO: consider shutting down the workers in the future.
# We don't do this for now due to this risk of hanging e.g. when calling
# `destroy_process_group` on an active group.
# `destroy_process_group` on an active group. A solution is to use a timeout
Copy link
Contributor

@TimothySeah TimothySeah Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I think consider shutting down the workers in the future. is no longer applicable because worker_group_state.shutdown does do that right? Do we need to fix the destroy_process_group on an active group issue in this PR too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the catch! I will remove the comment regarding the worker shutdown. As for the destroy_process_goup on an active group that is not triggered unless we also perform the before_worker_group_abort callbacks so it will not be included in this PR

# in TorchConfig.on_shutdown in case of a hang.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"TODO: add shutdown callback hooks"

self._assert_active()
for callback in self._callbacks:
callback.before_worker_group_abort(self._worker_group_context)

self._worker_group_state.shutdown()

#####################################################################################
# Polling Worker Group
#####################################################################################
Expand Down