Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make two improvements to the cache of remote data #26923

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mppf
Copy link
Member

@mppf mppf commented Mar 13, 2025

This PR makes two improvements to the runtime code supporting --cache-remote

  1. It frees nonblocking operation descriptors more eagerly, which impacts CHPL_COMM=ofi
  2. It adds thread fences to prevent the compiler or processor from reordering around the "lock" strategy added in PR Adjust remote cache to better handle task yields in comm events #16885.

Freeing nonblocking descriptors more eagerly

The issue is that the --cache-remote code will do non-blocking PUTs (generally speaking). Meanwhile, the ofi comm layer allocates some memory for each nonblocking handle. There is code to free a handle in do_wait_for, but this code will only free one handle per trip through the while loop, and the while loop is stopped if the pending sequence numbers of the later requests are larger. However, the runtime might have already completed such requests, so this PR adjusts this loop so that it keeps running until one of the following conditions is met:

  • there are no more pending requests
  • there is a pending request with a later sequence number that the comms layer does not think is already complete

Before this PR, this loop would stop if there was a pending request with a later sequence number, regardless of whether the comms layer thinks it is complete. That behavior resulted in the possibility of having many comm handles stacking up for completed operations, waiting to be freed.

Thread fences

In separate work, I had some code using a per-pthread data structure that is shared by many tasks & I was using a similar "lock"ing strategy as in the cache from PR #16885. I ran into problems where multiple tasks seemed to be accessing the same pthread-local state, but I was able to fix that by adding thread fences to aquire and release. On x86, I don't think these will impact what the hardware is doing (because of Total Store Order), but they will impact what compiler optimization does. I think we have just gotten lucky in terms of not being bitten yet by compiler optimization breaking these assumptions for --cache-remote. So, this PR adds an acquire fence when reading the "lock" unless I'm pretty sure the code is already holding the lock (this is done in a bunch of asserts and other checking); it adds an acquire fence when acquiring the lock, and it adds a release fence when releasing the lock. These fences should prevent compiler or processor from reordering things in a way that break the "lock", even though it does not synchronize between threads.

  • full comm=gasnet oversubscribed testing

Sorry, something went wrong.

mppf added 2 commits March 13, 2025 17:27
---
Signed-off-by: Michael Ferguson <mppf@users.noreply.github.com>
---
Signed-off-by: Michael Ferguson <mppf@users.noreply.github.com>
@mppf mppf changed the title Fix too much memory being used by comm handles on ofi Make two improvements to the cache of remote data Mar 18, 2025
@mppf mppf requested a review from jhh67 March 18, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants