Skip to content

Conversation

yancanmao
Copy link
Contributor

@yancanmao yancanmao commented Aug 22, 2025

Why are these changes needed?

Ray proactively triggers gc.collect() on idle workers to release Python objects that may still hold Plasma shared memory (shm) references.
In the current implementation in (_raylet.pyx gc_collect()), Ray calls gc.collect() from Cython under a with gil block periodically.
If the Python object graph is complex (e.g., cyclic references with finalizers), gc.collect() may take a long time. During this period, since the GIL is held for the entire collection, user code is completely frozen if gc.collect() time is longer than the periodic interval (e.g., 10s).

We propose decoupling GC execution from the RPC call:

gc_collect in Cython should not directly run gc.collect().
Instead, it should "signal an event" with minimum execution time (e.g., using a threading.Event or similar).
A dedicated Python GC thread consumes this event and executes gc.collect() asynchronously, with a configurable GC interval.

Related issue number

Closes #55837

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • [] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Mao Yancan <[email protected]>
@yancanmao yancanmao requested a review from a team as a code owner August 22, 2025 09:43
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dedicated garbage collection thread to prevent gc.collect() from blocking user code. The implementation is sound, but I've found a critical issue in a logging statement that would cause a runtime error. I've also identified an opportunity to simplify the logic in the new GCManagerThread by removing some redundant code. The new test case correctly validates the intended behavior of both the new asynchronous GC and the original synchronous GC.

Comment on lines 2498 to 2499
logger.debug("GC triggered in {} seconds".format(
triggered_by_global_gc, end - start))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The format string for this log message has one placeholder {} but two arguments (triggered_by_global_gc and end - start) are provided. This will raise a TypeError at runtime. Please correct the format string to include both arguments.

            logger.debug(f"GC triggered in {end - start:.4f} seconds (triggered_by_global_gc: {triggered_by_global_gc})")

Signed-off-by: Mao Yancan <[email protected]>
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Aug 22, 2025
@edoakes edoakes self-assigned this Aug 22, 2025
@edoakes
Copy link
Collaborator

edoakes commented Aug 22, 2025

Thanks for the detailed writeup @yancanmao. The problem statement makes sense to me. I need to noodle a bit on the solution -- I'll review what you have and get back soon.

self.gc_in_progress = True
try:
start = time.perf_counter()
num_freed = gc.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I search online, it seems gc.collect() holds the GIL during the entire time of the call anyway. Is that true, if so then moving to a python thread won't help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, gc.collect() holds the GIL, so it will always block user code. Our proposal doesn’t change that. But we think it’s still worth reconsidering how Ray triggers gc.collect():

  • Periodic GC amplifies blocking: if gc.collect() exceeds min_gc_interval, the next tick may fire right after, creating continuous stalls.

  • User destructors under GIL: during gc.collect(), Python runs __del__/__dealloc__ while holding the GIL.
    If destructors only do Python work, it keeps the GIL the whole time, and other threads are blocked.
    If destructors calls a C function, it releases the GIL (for example time.sleep() or I/O) and other threads can run in those gaps. But Ray’s current gc.collect() is wrapped with the GIL, so this benefit is lost.

  • Thread scheduling: when gc.collect() runs in a Python thread, the interpreter can do context-switch between worker/gc threads (default ~5 ms), giving user code a chance to make progress.

Therefore, moving periodic GC to a dedicated Python thread doesn’t remove the blocking nature of gc.collect(), but it avoids Ray introducing extra pauses and makes long stalls

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, let's go ahead with the proposed change

self.gc_in_progress = True
try:
start = time.perf_counter()
num_freed = gc.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, let's go ahead with the proposed change

Comment on lines 62 to 82
def init_gc_collect_manager(min_interval: int = 5) -> None:
global _gc_thread
with _gc_lock:
if _gc_thread is not None and _gc_thread.is_alive():
logger.warning("GC thread already initialized")
return

_gc_thread = GCManagerThread(min_interval)
_gc_thread.start()
logger.debug(f"GC thread initialized with min interval {min_interval}s")


def stop_gc_collect_manager_if_needed() -> None:
global _gc_thread
with _gc_lock:
if _gc_thread is not None:
_gc_thread.stop()
_gc_thread = None
logger.debug("GC thread stopped")
else:
logger.debug("No GC thread to stop")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can be simplified to be started in the CoreWorker constructor in _raylet.pyx. no need for concurrency control or lazy initialization

Comment on lines 85 to 86
def trigger_gc() -> None:
_gc_event.set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this a method on the GCManagerThread and have the core worker call it directly (if it keeps the GCManagerThread as an attribute)

Comment on lines 49 to 50
except Exception as e:
logger.error(f"Error during GC: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should update the last_gc_time in this case to avoid hotlooping

Comment on lines 37 to 38
if not self.gc_in_progress:
self.gc_in_progress = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of gc_in_progress?

Comment on lines 3057 to 3058
if RayConfig.instance().start_python_gc_manager_thread():
init_gc_collect_manager(ray_constants.RAY_GC_MIN_COLLECT_INTERVAL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make the thread an attribute of the core worker (see my other comment)

we also need to make sure to stop & join this thread when the core worker is shut down (for example if the user calls ray.shutdown explicitly, the thread should be stopped & joined)

yancanmao and others added 2 commits August 29, 2025 09:28
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
@yancanmao
Copy link
Contributor Author

Thanks for your insightful comments! I have revised the code as follows:

  1. Renamed GCThreadManager to PythonGCThread.
  2. Made trigger_gc() a method inside PythonGCThread.
  3. Removed locks/variables for concurrency control.
  4. Added UT for gc thread tests.
  5. Added gc_thread.stop() inside shutdown_driver()

Mao Yancan added 2 commits August 29, 2025 13:57
Signed-off-by: Mao Yancan <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
if not self._running:
break

time_since_last_gc = time.perf_counter() - self._last_gc_time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to use perf_counter here -- it's more expensive since it's high resolution

assert gc_thread.is_alive()

gc_thread.trigger_gc()
time.sleep(0.1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't depend on timing conditions in tests -- please inject a fake clock instead of using real time.sleep and make the tests deterministic

@@ -216,5 +219,140 @@ def f(self):
gc.enable()


@pytest.mark.timeout(30)
@pytest.mark.parametrize("support_fork", [True, False])
def test_long_local_gc(shutdown_only, support_fork):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is too expensive/slow to run in CI and is highly likely to be flaky. Instead, let's add a basic test to check that the garbage is only collected once per interval, and set the interval to be something modest like 5s.

@edoakes
Copy link
Collaborator

edoakes commented Sep 2, 2025

Mostly nits at this point, thanks for addressing the comments!

yancanmao and others added 3 commits September 3, 2025 10:09
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Mao Yancan <[email protected]>
@yancanmao yancanmao force-pushed the new_gc_collect_thread branch from 15c688d to 92a2069 Compare September 3, 2025 05:16
Mao Yancan and others added 6 commits September 3, 2025 13:17
@yancanmao
Copy link
Contributor Author

Mostly nits at this point, thanks for addressing the comments!

Thanks for the comments! I have updated the code accordingly.

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Sep 3, 2025
@edoakes
Copy link
Collaborator

edoakes commented Sep 3, 2025

I've triggered CI. Please tag me when tests are passing for merge!

@yancanmao
Copy link
Contributor Author

I've triggered CI. Please tag me when tests are passing for merge!

All tests passed! Thanks for the review!

@edoakes
Copy link
Collaborator

edoakes commented Sep 4, 2025

@jjyao final pass before merging?

@edoakes edoakes merged commit 3303031 into ray-project:master Sep 5, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] gc_collect() in _raylet.pyx blocks user program due to GIL hold
3 participants