-
Notifications
You must be signed in to change notification settings - Fork 6.8k
gc collect from a gc_thread #55838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc collect from a gc_thread #55838
Conversation
Signed-off-by: Mao Yancan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dedicated garbage collection thread to prevent gc.collect()
from blocking user code. The implementation is sound, but I've found a critical issue in a logging statement that would cause a runtime error. I've also identified an opportunity to simplify the logic in the new GCManagerThread
by removing some redundant code. The new test case correctly validates the intended behavior of both the new asynchronous GC and the original synchronous GC.
python/ray/_raylet.pyx
Outdated
logger.debug("GC triggered in {} seconds".format( | ||
triggered_by_global_gc, end - start)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format string for this log message has one placeholder {}
but two arguments (triggered_by_global_gc
and end - start
) are provided. This will raise a TypeError
at runtime. Please correct the format string to include both arguments.
logger.debug(f"GC triggered in {end - start:.4f} seconds (triggered_by_global_gc: {triggered_by_global_gc})")
Signed-off-by: Mao Yancan <[email protected]>
Thanks for the detailed writeup @yancanmao. The problem statement makes sense to me. I need to noodle a bit on the solution -- I'll review what you have and get back soon. |
self.gc_in_progress = True | ||
try: | ||
start = time.perf_counter() | ||
num_freed = gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I search online, it seems gc.collect()
holds the GIL during the entire time of the call anyway. Is that true, if so then moving to a python thread won't help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, gc.collect()
holds the GIL, so it will always block user code. Our proposal doesn’t change that. But we think it’s still worth reconsidering how Ray triggers gc.collect()
:
-
Periodic GC amplifies blocking: if
gc.collect()
exceeds min_gc_interval, the next tick may fire right after, creating continuous stalls. -
User destructors under GIL: during
gc.collect()
, Python runs__del__/__dealloc__
while holding the GIL.
If destructors only do Python work, it keeps the GIL the whole time, and other threads are blocked.
If destructors calls a C function, it releases the GIL (for exampletime.sleep()
orI/O
) and other threads can run in those gaps. But Ray’s currentgc.collect()
is wrapped with the GIL, so this benefit is lost. -
Thread scheduling: when gc.collect() runs in a Python thread, the interpreter can do context-switch between worker/gc threads (default ~5 ms), giving user code a chance to make progress.
Therefore, moving periodic GC to a dedicated Python thread doesn’t remove the blocking nature of gc.collect()
, but it avoids Ray introducing extra pauses and makes long stalls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, let's go ahead with the proposed change
self.gc_in_progress = True | ||
try: | ||
start = time.perf_counter() | ||
num_freed = gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, let's go ahead with the proposed change
def init_gc_collect_manager(min_interval: int = 5) -> None: | ||
global _gc_thread | ||
with _gc_lock: | ||
if _gc_thread is not None and _gc_thread.is_alive(): | ||
logger.warning("GC thread already initialized") | ||
return | ||
|
||
_gc_thread = GCManagerThread(min_interval) | ||
_gc_thread.start() | ||
logger.debug(f"GC thread initialized with min interval {min_interval}s") | ||
|
||
|
||
def stop_gc_collect_manager_if_needed() -> None: | ||
global _gc_thread | ||
with _gc_lock: | ||
if _gc_thread is not None: | ||
_gc_thread.stop() | ||
_gc_thread = None | ||
logger.debug("GC thread stopped") | ||
else: | ||
logger.debug("No GC thread to stop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can be simplified to be started in the CoreWorker
constructor in _raylet.pyx
. no need for concurrency control or lazy initialization
def trigger_gc() -> None: | ||
_gc_event.set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can make this a method on the GCManagerThread
and have the core worker call it directly (if it keeps the GCManagerThread
as an attribute)
except Exception as e: | ||
logger.error(f"Error during GC: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should update the last_gc_time in this case to avoid hotlooping
if not self.gc_in_progress: | ||
self.gc_in_progress = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of gc_in_progress
?
python/ray/_raylet.pyx
Outdated
if RayConfig.instance().start_python_gc_manager_thread(): | ||
init_gc_collect_manager(ray_constants.RAY_GC_MIN_COLLECT_INTERVAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make the thread an attribute of the core worker (see my other comment)
we also need to make sure to stop & join this thread when the core worker is shut down (for example if the user calls ray.shutdown
explicitly, the thread should be stopped & joined)
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Thanks for your insightful comments! I have revised the code as follows:
|
Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
if not self._running: | ||
break | ||
|
||
time_since_last_gc = time.perf_counter() - self._last_gc_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use perf_counter
here -- it's more expensive since it's high resolution
python/ray/tests/test_global_gc.py
Outdated
assert gc_thread.is_alive() | ||
|
||
gc_thread.trigger_gc() | ||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't depend on timing conditions in tests -- please inject a fake clock instead of using real time.sleep
and make the tests deterministic
python/ray/tests/test_global_gc.py
Outdated
@@ -216,5 +219,140 @@ def f(self): | |||
gc.enable() | |||
|
|||
|
|||
@pytest.mark.timeout(30) | |||
@pytest.mark.parametrize("support_fork", [True, False]) | |||
def test_long_local_gc(shutdown_only, support_fork): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is too expensive/slow to run in CI and is highly likely to be flaky. Instead, let's add a basic test to check that the garbage is only collected once per interval, and set the interval to be something modest like 5s.
Mostly nits at this point, thanks for addressing the comments! |
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>
15c688d
to
92a2069
Compare
Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Thanks for the comments! I have updated the code accordingly. |
I've triggered CI. Please tag me when tests are passing for merge! |
All tests passed! Thanks for the review! |
@jjyao final pass before merging? |
Why are these changes needed?
Ray proactively triggers gc.collect() on idle workers to release Python objects that may still hold Plasma shared memory (shm) references.
In the current implementation in (_raylet.pyx gc_collect()), Ray calls gc.collect() from Cython under a with gil block periodically.
If the Python object graph is complex (e.g., cyclic references with finalizers), gc.collect() may take a long time. During this period, since the GIL is held for the entire collection, user code is completely frozen if gc.collect() time is longer than the periodic interval (e.g., 10s).
We propose decoupling GC execution from the RPC call:
gc_collect in Cython should not directly run gc.collect().
Instead, it should "signal an event" with minimum execution time (e.g., using a threading.Event or similar).
A dedicated Python GC thread consumes this event and executes gc.collect() asynchronously, with a configurable GC interval.
Related issue number
Closes #55837
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.