gc collect from a gc_thread #55838

yancanmao · 2025-08-22T09:43:18Z

Why are these changes needed?

Ray proactively triggers gc.collect() on idle workers to release Python objects that may still hold Plasma shared memory (shm) references.
In the current implementation in (_raylet.pyx gc_collect()), Ray calls gc.collect() from Cython under a with gil block periodically.
If the Python object graph is complex (e.g., cyclic references with finalizers), gc.collect() may take a long time. During this period, since the GIL is held for the entire collection, user code is completely frozen if gc.collect() time is longer than the periodic interval (e.g., 10s).

We propose decoupling GC execution from the RPC call:

gc_collect in Cython should not directly run gc.collect().
Instead, it should "signal an event" with minimum execution time (e.g., using a threading.Event or similar).
A dedicated Python GC thread consumes this event and executes gc.collect() asynchronously, with a configurable GC interval.

Related issue number

Closes #55837

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
[] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Mao Yancan <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a dedicated garbage collection thread to prevent gc.collect() from blocking user code. The implementation is sound, but I've found a critical issue in a logging statement that would cause a runtime error. I've also identified an opportunity to simplify the logic in the new GCManagerThread by removing some redundant code. The new test case correctly validates the intended behavior of both the new asynchronous GC and the original synchronous GC.

gemini-code-assist · 2025-08-22T09:45:14Z

python/ray/_raylet.pyx

+            logger.debug("GC triggered in {} seconds".format(
+                triggered_by_global_gc, end - start))


The format string for this log message has one placeholder {} but two arguments (triggered_by_global_gc and end - start) are provided. This will raise a TypeError at runtime. Please correct the format string to include both arguments.

logger.debug(f"GC triggered in {end - start:.4f} seconds (triggered_by_global_gc: {triggered_by_global_gc})")

python/ray/_private/gc_collect_manager.py

Signed-off-by: Mao Yancan <[email protected]>

edoakes · 2025-08-22T21:28:04Z

Thanks for the detailed writeup @yancanmao. The problem statement makes sense to me. I need to noodle a bit on the solution -- I'll review what you have and get back soon.

jjyao · 2025-08-27T17:01:45Z

python/ray/_private/gc_collect_manager.py

+                self.gc_in_progress = True
+                try:
+                    start = time.perf_counter()
+                    num_freed = gc.collect()


When I search online, it seems gc.collect() holds the GIL during the entire time of the call anyway. Is that true, if so then moving to a python thread won't help?

This is a good point, gc.collect() holds the GIL, so it will always block user code. Our proposal doesn’t change that. But we think it’s still worth reconsidering how Ray triggers gc.collect():

Periodic GC amplifies blocking: if gc.collect() exceeds min_gc_interval, the next tick may fire right after, creating continuous stalls.

User destructors under GIL: during gc.collect(), Python runs __del__/__dealloc__ while holding the GIL.
If destructors only do Python work, it keeps the GIL the whole time, and other threads are blocked.
If destructors calls a C function, it releases the GIL (for example time.sleep() or I/O) and other threads can run in those gaps. But Ray’s current gc.collect() is wrapped with the GIL, so this benefit is lost.

Thread scheduling: when gc.collect() runs in a Python thread, the interpreter can do context-switch between worker/gc threads (default ~5 ms), giving user code a chance to make progress.

Therefore, moving periodic GC to a dedicated Python thread doesn’t remove the blocking nature of gc.collect(), but it avoids Ray introducing extra pauses and makes long stalls

Good points, let's go ahead with the proposed change

python/ray/_private/gc_collect_manager.py

edoakes · 2025-08-28T21:03:45Z

python/ray/_private/gc_collect_manager.py

+                self.gc_in_progress = True
+                try:
+                    start = time.perf_counter()
+                    num_freed = gc.collect()


Good points, let's go ahead with the proposed change

edoakes · 2025-08-28T21:03:55Z

python/ray/_private/gc_collect_manager.py

+def init_gc_collect_manager(min_interval: int = 5) -> None:
+    global _gc_thread
+    with _gc_lock:
+        if _gc_thread is not None and _gc_thread.is_alive():
+            logger.warning("GC thread already initialized")
+            return
+
+        _gc_thread = GCManagerThread(min_interval)
+        _gc_thread.start()
+        logger.debug(f"GC thread initialized with min interval {min_interval}s")
+
+
+def stop_gc_collect_manager_if_needed() -> None:
+    global _gc_thread
+    with _gc_lock:
+        if _gc_thread is not None:
+            _gc_thread.stop()
+            _gc_thread = None
+            logger.debug("GC thread stopped")
+        else:
+            logger.debug("No GC thread to stop")


I think these can be simplified to be started in the CoreWorker constructor in _raylet.pyx. no need for concurrency control or lazy initialization

edoakes · 2025-08-28T21:04:30Z

python/ray/_private/gc_collect_manager.py

+def trigger_gc() -> None:
+    _gc_event.set()


we can make this a method on the GCManagerThread and have the core worker call it directly (if it keeps the GCManagerThread as an attribute)

edoakes · 2025-08-28T21:04:49Z

python/ray/_private/gc_collect_manager.py

+                except Exception as e:
+                    logger.error(f"Error during GC: {e}")


we should update the last_gc_time in this case to avoid hotlooping

edoakes · 2025-08-28T21:05:20Z

python/ray/_private/gc_collect_manager.py

+            if not self.gc_in_progress:
+                self.gc_in_progress = True


what's the purpose of gc_in_progress?

edoakes · 2025-08-28T21:06:12Z

python/ray/_raylet.pyx

+        if RayConfig.instance().start_python_gc_manager_thread():
+            init_gc_collect_manager(ray_constants.RAY_GC_MIN_COLLECT_INTERVAL)


let's make the thread an attribute of the core worker (see my other comment)

we also need to make sure to stop & join this thread when the core worker is shut down (for example if the user calls ray.shutdown explicitly, the thread should be stopped & joined)

python/ray/tests/test_global_gc.py

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>

Signed-off-by: Mao Yancan <[email protected]>

yancanmao · 2025-08-29T05:30:17Z

Thanks for your insightful comments! I have revised the code as follows:

Renamed GCThreadManager to PythonGCThread.
Made trigger_gc() a method inside PythonGCThread.
Removed locks/variables for concurrency control.
Added UT for gc thread tests.
Added gc_thread.stop() inside shutdown_driver()

Signed-off-by: Mao Yancan <[email protected]>

python/ray/_private/gc_collect_manager.py

edoakes · 2025-09-02T20:29:15Z

python/ray/_private/gc_collect_manager.py

+            if not self._running:
+                break
+
+            time_since_last_gc = time.perf_counter() - self._last_gc_time


no need to use perf_counter here -- it's more expensive since it's high resolution

edoakes · 2025-09-02T20:32:30Z

python/ray/tests/test_global_gc.py

+        assert gc_thread.is_alive()
+
+        gc_thread.trigger_gc()
+        time.sleep(0.1)


we shouldn't depend on timing conditions in tests -- please inject a fake clock instead of using real time.sleep and make the tests deterministic

edoakes · 2025-09-02T20:37:28Z

python/ray/tests/test_global_gc.py

@@ -216,5 +219,140 @@ def f(self):
        gc.enable()


+@pytest.mark.timeout(30)
+@pytest.mark.parametrize("support_fork", [True, False])
+def test_long_local_gc(shutdown_only, support_fork):


This test is too expensive/slow to run in CI and is highly likely to be flaky. Instead, let's add a basic test to check that the garbage is only collected once per interval, and set the interval to be something modest like 5s.

edoakes · 2025-09-02T20:37:49Z

Mostly nits at this point, thanks for addressing the comments!

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>

Signed-off-by: Mao Yancan <[email protected]>

into new_gc_collect_thread

Signed-off-by: Mao Yancan <[email protected]>

yancanmao · 2025-09-03T08:16:33Z

Mostly nits at this point, thanks for addressing the comments!

Thanks for the comments! I have updated the code accordingly.

edoakes · 2025-09-03T22:06:01Z

I've triggered CI. Please tag me when tests are passing for merge!

yancanmao · 2025-09-04T01:49:12Z

I've triggered CI. Please tag me when tests are passing for merge!

All tests passed! Thanks for the review!

edoakes · 2025-09-04T21:51:42Z

@jjyao final pass before merging?

gc collect from gc_thread

5ebada6

Signed-off-by: Mao Yancan <[email protected]>

yancanmao requested a review from a team as a code owner August 22, 2025 09:43

gemini-code-assist bot reviewed Aug 22, 2025

View reviewed changes

code clean

14c1ae8

Signed-off-by: Mao Yancan <[email protected]>

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Aug 22, 2025

edoakes self-assigned this Aug 22, 2025

jjyao reviewed Aug 27, 2025

View reviewed changes

edoakes reviewed Aug 28, 2025

View reviewed changes

yancanmao and others added 2 commits August 29, 2025 09:28

Update python/ray/_private/gc_collect_manager.py

28d7fa3

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>

Revision based on comments

bf93e65

Signed-off-by: Mao Yancan <[email protected]>

Mao Yancan added 2 commits August 29, 2025 13:57

lint reformat

3e67404

Signed-off-by: Mao Yancan <[email protected]>

fix ut

119c47a

Signed-off-by: Mao Yancan <[email protected]>

edoakes reviewed Sep 2, 2025

View reviewed changes

yancanmao and others added 3 commits September 3, 2025 10:09

Update python/ray/_private/gc_collect_manager.py

eaadb7b

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>

Update python/ray/_private/gc_collect_manager.py

8e4a30c

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>

Update python/ray/_private/gc_collect_manager.py

92a2069

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Mao Yancan <[email protected]>

yancanmao force-pushed the new_gc_collect_thread branch from 15c688d to 92a2069 Compare September 3, 2025 05:16

Mao Yancan and others added 6 commits September 3, 2025 13:17

nits update

47c931d

Signed-off-by: Mao Yancan <[email protected]>

lint update

aec2cea

Signed-off-by: Mao Yancan <[email protected]>

Merge branch 'master' into new_gc_collect_thread

1f40d0b

Signed-off-by: Mao Yancan <[email protected]>

gc_thread use monotonic instead of perf_counter

476fd7c

Signed-off-by: Mao Yancan <[email protected]>

Merge branch 'new_gc_collect_thread' of https://github.com/yancanmao/ray

3d27062

into new_gc_collect_thread

lint reformat

464f537

Signed-off-by: Mao Yancan <[email protected]>

edoakes approved these changes Sep 3, 2025

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label Sep 3, 2025

Merge branch 'master' into new_gc_collect_thread

0fb2485

edoakes merged commit 3303031 into ray-project:master Sep 5, 2025
5 checks passed

		logger.debug("GC triggered in {} seconds".format(
		triggered_by_global_gc, end - start))

		if RayConfig.instance().start_python_gc_manager_thread():
		init_gc_collect_manager(ray_constants.RAY_GC_MIN_COLLECT_INTERVAL)

gc collect from a gc_thread #55838

gc collect from a gc_thread #55838

Uh oh!

Conversation

yancanmao commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edoakes commented Aug 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yancanmao commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edoakes commented Sep 2, 2025

Uh oh!

yancanmao commented Sep 3, 2025

Uh oh!

edoakes commented Sep 3, 2025

Uh oh!

yancanmao commented Sep 4, 2025

Uh oh!

edoakes commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

yancanmao commented Aug 22, 2025 •

edited

Loading