Skip to content

Conversation

@lee1258561
Copy link

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

memory_full_info is very expensive and degrade job performance

Detail github issue: #55117
Pinterest blog post outline the debug detail: https://medium.com/@Pinterest_Engineering/tracking-down-mysterious-ml-training-stalls-5290bb19be6d

Related issues

#55117

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@lee1258561 lee1258561 requested a review from a team as a code owner January 9, 2026 08:41
@lee1258561 lee1258561 added the go add ONLY when ready to merge, run all tests label Jan 9, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a performance degradation issue by disabling the collection of memory_full_info. The change is implemented by commenting out the field from PSUTIL_PROCESS_ATTRS. While this is effective, for better long-term maintainability, I suggest removing the line completely instead of commenting it out, as version control can track the history of this change.

Signed-off-by: Jiun-Yu Lee <[email protected]>
Signed-off-by: Jiun-Yu Lee <[email protected]>
@ray-gardener ray-gardener bot added performance core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 9, 2026
Signed-off-by: Jiun-Yu Lee <[email protected]>
…erest/ray into lee1258561/fix_perf_full_memory_info
if hasattr(mem, "shared"):
total_shm += float(mem.shared)
mem_full_info = stat.get("memory_full_info")
mem_full_info = stat.get(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USS metric still emitted during worker reset despite collection disabled

Medium Severity

The PR removes memory_full_info from PSUTIL_PROCESS_ATTRS to disable USS collection for performance reasons, and tests were updated to not expect the ray_component_uss_mb metric. However, the _generate_reseted_stats_record function still unconditionally emits component_uss_mb with value 0 when worker processes terminate. This creates inconsistent behavior where USS metrics never appear during normal operation but still appear (with only 0 values) during worker resets. The reset function at lines 1144-1150 should also have the USS metric emission removed to be consistent with the PR's intent.

🔬 Verification Test

Why verification test was not possible: This requires running the Ray framework in a full cluster setup with workers that start and terminate, along with Prometheus metric scraping infrastructure. The bug manifests only when worker processes terminate and the reset metrics are generated, which cannot be easily unit tested without the full Ray infrastructure running.

Fix in Cursor Fix in Web

@edoakes
Copy link
Collaborator

edoakes commented Jan 9, 2026

@lee1258561 if I'm understanding correctly, this means that we will no longer be able to report process' USS memory consumption. That seems like a pretty big loss, as RSS is often misleading. Have you investigated if there is a cheaper way to collect USS information such as directly polling the proc filesystem? Or is that itself the root cause of the performance issue?

@lee1258561
Copy link
Author

lee1258561 commented Jan 9, 2026

@edoakes thanks for the reply

Yes that is correct, unfortunately for what the current PR does we will lost USS metrics.

But in my opinion, this has cause Pinterest:

  • 50% performance lose in ML training job
  • 3 engineers > 3 months of time to debug the issue (see out blog post)

And I believe fixing ASAP for all other ray user outweigh the lost of the metrics.

Per the blog post said, psutil.memory_full_info will put a lock on target process's page table so the target process will be block until the stats collection has completed.

I've ask gemini, whether there is another way to collect the USS stats. It told me that we can do manual calculation from this file /proc/[pid]/smaps or /proc/[pid]/smaps_rollup. But reading this file itself will also just put lock so this would not solve the problem.

https://share.google/aimode/wny8iuppUI9KcWu5l

From this investigation I don't see any feasible way to get rid of lock but still access USS. The only things seems mildly promissing is:

  • to calculate from /proc/[pid]/smaps_rollup which gemini suggest it is much faster and hold lock for much less amount of time.
  • Or we can try to reduce the schedule of USS collection, current it is very frequent ever 2.5 sec

@edoakes I'd propose we check in this hot fix first and try to investigate a proper method of collecting USS stats as a follow up if it is really important.

@MengjinYan
Copy link
Contributor

cc: @Kunchd

mem_full_info = stat.get("memory_full_info")
mem_full_info = stat.get(
"memory_full_info"
) # psutil.memory_full_info is too expensive. We currently disable collecting this metrics. https://github.com/ray-project/ray/issues/55117
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not gonna report it, why even keep this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a previous thread said. @edoakes mention USS is still import and we might want to put a follow up to collect this stats in a less expensive and more acceptable way.

I can also just remove them completely if it is not that important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we're going to actively follow up and bring it back, then this is just noise. If it was configurable I'd say keep it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add an environment variable to make this configurable and by default to False.

You could also get a close approximation by just taking RSS - shared. You'll still get RSS and shared memory from psutil.memory_info.

Thanks for the advice!

I am not an expert of how to do this approximation, but if you could help provide the formula. I can make it part of this PR as well.

@ZacAttack
Copy link
Contributor

@edoakes thanks for the reply

Yes that is correct, unfortunately for what the current PR does we will lost USS metrics.

But in my opinion, this has cause Pinterest:

  • 50% performance lose in ML training job
  • 3 engineers > 3 months of time to debug the issue (see out blog post)

And I believe fixing ASAP for all other ray user outweigh the lost of the metrics.

Per the blog post said, psutil.memory_full_info will put a lock on target process's page table so the target process will be block until the stats collection has completed.

I've ask gemini, whether there is another way to collect the USS stats. It told me that we can do manual calculation from this file /proc/[pid]/smaps or /proc/[pid]/smaps_rollup. But reading this file itself will also just put lock so this would not solve the problem.

https://share.google/aimode/wny8iuppUI9KcWu5l

From this investigation I don't see any feasible way to get rid of lock but still access USS. The only things seems mildly promissing is:

  • to calculate from /proc/[pid]/smaps_rollup which gemini suggest it is much faster and hold lock for much less amount of time.
  • Or we can try to reduce the schedule of USS collection, current it is very frequent ever 2.5 sec

@edoakes I'd propose we check in this hot fix first and try to investigate a proper method of collecting USS stats as a follow up if it is really important.

You could also get a close approximation by just taking RSS - shared. You'll still get RSS and shared memory from psutil.memory_info.

'Technically' speaking this isn't exactly USS as USS has a slightly tweaked calculation when it comes to a process which is the single owner on some shared memory (shared memory which only has one accessor automatically gets folded into USS). But... that caveat might not be that interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants