-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Ray Core] remove memory full info #60000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Ray Core] remove memory full info #60000
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a performance degradation issue by disabling the collection of memory_full_info. The change is implemented by commenting out the field from PSUTIL_PROCESS_ATTRS. While this is effective, for better long-term maintainability, I suggest removing the line completely instead of commenting it out, as version control can track the history of this change.
Signed-off-by: Jiun-Yu Lee <[email protected]>
Signed-off-by: Jiun-Yu Lee <[email protected]>
Signed-off-by: Jiun-Yu Lee <[email protected]>
…erest/ray into lee1258561/fix_perf_full_memory_info
| if hasattr(mem, "shared"): | ||
| total_shm += float(mem.shared) | ||
| mem_full_info = stat.get("memory_full_info") | ||
| mem_full_info = stat.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USS metric still emitted during worker reset despite collection disabled
Medium Severity
The PR removes memory_full_info from PSUTIL_PROCESS_ATTRS to disable USS collection for performance reasons, and tests were updated to not expect the ray_component_uss_mb metric. However, the _generate_reseted_stats_record function still unconditionally emits component_uss_mb with value 0 when worker processes terminate. This creates inconsistent behavior where USS metrics never appear during normal operation but still appear (with only 0 values) during worker resets. The reset function at lines 1144-1150 should also have the USS metric emission removed to be consistent with the PR's intent.
🔬 Verification Test
Why verification test was not possible: This requires running the Ray framework in a full cluster setup with workers that start and terminate, along with Prometheus metric scraping infrastructure. The bug manifests only when worker processes terminate and the reset metrics are generated, which cannot be easily unit tested without the full Ray infrastructure running.
|
@lee1258561 if I'm understanding correctly, this means that we will no longer be able to report process' |
|
@edoakes thanks for the reply Yes that is correct, unfortunately for what the current PR does we will lost USS metrics. But in my opinion, this has cause Pinterest:
And I believe fixing ASAP for all other ray user outweigh the lost of the metrics. Per the blog post said, I've ask gemini, whether there is another way to collect the USS stats. It told me that we can do manual calculation from this file https://share.google/aimode/wny8iuppUI9KcWu5l From this investigation I don't see any feasible way to get rid of lock but still access USS. The only things seems mildly promissing is:
@edoakes I'd propose we check in this hot fix first and try to investigate a proper method of collecting USS stats as a follow up if it is really important. |
|
cc: @Kunchd |
| mem_full_info = stat.get("memory_full_info") | ||
| mem_full_info = stat.get( | ||
| "memory_full_info" | ||
| ) # psutil.memory_full_info is too expensive. We currently disable collecting this metrics. https://github.com/ray-project/ray/issues/55117 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're not gonna report it, why even keep this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a previous thread said. @edoakes mention USS is still import and we might want to put a follow up to collect this stats in a less expensive and more acceptable way.
I can also just remove them completely if it is not that important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we're going to actively follow up and bring it back, then this is just noise. If it was configurable I'd say keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add an environment variable to make this configurable and by default to False.
You could also get a close approximation by just taking RSS - shared. You'll still get RSS and shared memory from psutil.memory_info.
Thanks for the advice!
I am not an expert of how to do this approximation, but if you could help provide the formula. I can make it part of this PR as well.
You could also get a close approximation by just taking RSS - shared. You'll still get RSS and shared memory from psutil.memory_info. 'Technically' speaking this isn't exactly USS as USS has a slightly tweaked calculation when it comes to a process which is the single owner on some shared memory (shared memory which only has one accessor automatically gets folded into USS). But... that caveat might not be that interesting. |
Description
memory_full_info is very expensive and degrade job performance
Detail github issue: #55117
Pinterest blog post outline the debug detail: https://medium.com/@Pinterest_Engineering/tracking-down-mysterious-ml-training-stalls-5290bb19be6d
Related issues
#55117
Additional information