Make two improvements to the cache of remote data #26923
+35
−16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes two improvements to the runtime code supporting
--cache-remote
CHPL_COMM=ofi
Freeing nonblocking descriptors more eagerly
The issue is that the
--cache-remote
code will do non-blocking PUTs (generally speaking). Meanwhile, the ofi comm layer allocates some memory for each nonblocking handle. There is code to free a handle indo_wait_for
, but this code will only free one handle per trip through thewhile
loop, and thewhile
loop is stopped if the pending sequence numbers of the later requests are larger. However, the runtime might have already completed such requests, so this PR adjusts this loop so that it keeps running until one of the following conditions is met:Before this PR, this loop would stop if there was a pending request with a later sequence number, regardless of whether the comms layer thinks it is complete. That behavior resulted in the possibility of having many comm handles stacking up for completed operations, waiting to be freed.
Thread fences
In separate work, I had some code using a per-pthread data structure that is shared by many tasks & I was using a similar "lock"ing strategy as in the cache from PR #16885. I ran into problems where multiple tasks seemed to be accessing the same pthread-local state, but I was able to fix that by adding thread fences to aquire and release. On x86, I don't think these will impact what the hardware is doing (because of Total Store Order), but they will impact what compiler optimization does. I think we have just gotten lucky in terms of not being bitten yet by compiler optimization breaking these assumptions for
--cache-remote
. So, this PR adds an acquire fence when reading the "lock" unless I'm pretty sure the code is already holding the lock (this is done in a bunch of asserts and other checking); it adds an acquire fence when acquiring the lock, and it adds a release fence when releasing the lock. These fences should prevent compiler or processor from reordering things in a way that break the "lock", even though it does not synchronize between threads.