-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCCL deadlock discussion (was: increase NCCL_STEPS to match WARPSIZE/4) #1600
base: develop
Are you sure you want to change the base?
Conversation
Thanks you @jglaser ! This is a valuable contribution. |
@jglaser While I believe the deadlock you mentioned could be real, I am not convinced on the cause being WARP_SIZE. @nicholasmalaya can you help arrange a call to discuss? I think I still have Jens email address as we exchanged some emails 3 years ago on a different issue |
Yes. I will reach out for us to discuss this in more detail. |
@jglaser can you help dump proxy state with below patch to confirm reason of sender/receiver being stuck?
|
OK, I think the Interestingly, RCCL performance also seems to be significantly improved (>2x) since that commit. To find out, I used
Use the wrapper with
As the commit is unfortunately quite large, it would be interesting to pinpoint the source code line that caused the change in behavior. |
@jglaser when there are large number of GPUs, it is hard to debug with rocgdb. For example, we may find GPU N is stuck in waiting for receive, but it is due to GPU N-1 didn't send. But GPU N-1 didn't send because it is wait for receive from GPU N-2... We need to have all the logs from entire ring to identify root of the problem. Latest develop branch has a lot of improvements in getting GPU kernel logs for this scenario, but CPU proxy side logging is still lacking |
Here is a stalled trace after the "Continuing" the GPUs are hanging
and a successful one with the latest HEAD Command:
|
and the first half, too, which include RCCL setup |
Caveat: I have another example here (RL training), which freezes even with the ccb08 commit.. (in allgather) |
@jglaser Increasing NCCL_STEPS makes GPU kernel side FIFO deeper. I am wondering if this somehow helps with network stack which has another FIFO. There is NCCL_OFI_MAX_REQUESTS in aws ofi plugin https://github.com/ROCm/aws-ofi-rccl/blob/17d41cbf5618536c4b1076d29748416ab307040f/include/nccl_ofi.h#L68 |
Another observation: using the reproducer as before with the failing commit 43575..., but when RCCL has been compiled with |
I should perhaps mention that I replaced the thread barrier in that commit with my own, but am not sure if it that led to to the optimization level dependent behavior (this one avoids potential memory ordering issues by entirely relying on atomicCAS)
|
@jglaser if you are testing collectives like all reduce and broadcast etc, barrier_by_group() will call __builtin_amdgcn_s_barrier(), because RCCL will not reduce nthreads from NCCL_MAX_NTHREADS which is 256. |
I think I found the issue of what was causing the hang in my application: multiple (N/R)CCL communicators used by multiple CPU threads. It is well known that NCCL is not thread safe, see e.g. When I analyzed the call pattern from my app, I noticed that there was more than one comunnicator (intra+internode from FSPD HYBRID_SHARD). Each communicator had its own HIP stream associated with it. However, these communicators were shared between different host threads. That creates a danger! Because the launch order on the host thread is random, whereas the GPU synchronizes (or, serializes) kernels launched to the same stream. Therefore, deadlocks are expected. I confirmed the thread unsafe behavior of RCCL with this code
RCCL thread-unsafe reproducer This demonstration confirms that launching RCCL collectives from the same GPU
|
On the python side, I could confirm that PyTorch uses multiple threads to parallelize the backward pass
output
Notice PID != Thread ID |
Now the mitigation strategy became obvious:
Torch has an option to turn off threading during autograd:
|
To ensure that streams sync before and after collectives, a little more effort is needed: another context manager
Using these two together fixes the hang for me. Performance may not be optimal and NCCL usage in torch autograd may have to be re-examined. |
@thananon (as per offline discussion) try this patch ... if it still hangs, the next escalation strategy would be to replace the above two calls to |
@jglaser You found me here. Yes, I incorporated this patch in the latest job as you suggested. I hope we get good result back. |
After fixing the first hang, my app was able to make it to an error message, which I was able to fix. Subsequently, however, it still sometimes hung. This is a minimal version of the patch that expands the coverage of distributed collectives, and which also only has a single (necessary) synchronization point.
Using this version, and with autograd threads disabled as before, my app runs to completion on 64 nodes using rccl commit 532f54c (or actually, to the next OOM :) The likely cause is that running concurrent collectives on different streams/communicators (e.g. intra/internode) still requires stream dependencies to be set, which may (or may not) be implemented in FSDP. The above sync should be unnecessary once NCCL upstream 2.26 is merged into RCCL, so that users can set NCCL_LAUNCH_ORDER_IMPLICIT. Then, they only need to ensure that the host side order of calls is consistent, e.g., by disabling threads. |
pytorch/pytorch#147729 seems related |
@jglaser can you try building pytorch/pytorch#148590 to see if it resolves the deadlock? Or do you not expect it to resolve the issue? |
Will do -- and just FYI, I still did encounter a hang with my RL app during collectives, after switching to full BF16 training (instead of mixed).... will investigate @thananon |
@jglaser Just realizing #148590 was relanded as pytorch/pytorch@acf5139 due to merge conflict. So it's been in main since Mon Mar 31. No need to build that PR yourself, it should be in nightly wheels. Have you given a nightly wheel a try yet? |
Details
Work item: GH issue pending
What were the changes?
Increase the pipeline size (NCCL_STEPS) from 8 to 16 to maintain correct synchronization behavior
Why were the changes made?
On Frontier at OLCF with 8 x MI250X nodes, RCCL sporadically stalls, especially when executed on many (>~ 128) nodes, or when using the libfabrics plugin (aws-ofi-rccl). The proposed change adjusts the pipeline width in the 'simple' communication protocol to be compatible with the larger warp (wavefront) size on AMD, which is 64, compared to 32 on NVIDIA hardware.
How was the outcome achieved?
A debug trace with roc-gdb demonstrated that under stall conditions and during a NCCL broadcast, some threads on the first rank in the ring (GPU0) are hanging inside a busy wait loop (
waitPeer
)rccl/src/device/prims_simple.h
Lines 118 to 127 in 4237caa
A closer inspection revealed that GPU 0 was eight steps (
NCCL_STEPS=8
) ahead of GPU1 and waiting for it to be ready for itself to send, while GPU 1 was waiting on GPU 0 to receive, i.e. the two GPUs were found in a circular deadlock insidewaitPeer
, which should not be allowed by design. However, to ensure asynchronous progress (i.e., GPU 1 sending and receiving at the same time), the original NCCL design uses warp level synchronization and a FIFO pipeline width that matches the maximum number (8) of progress groups of four threads each (waitRecv
,waitSend
,postRecv
, andpostSend
) that fit into a warp. If more than this number of groups fits into a warp, the warp-level synchronization will stall asynchronous progress made by another pipeline. By increasingNCCL_STEPS
to 16, this guarantees that one wave front will exactly fit one pipeline.Additional Documentation:
It may be necessary to assess the performance implications of this patch and re-tune collective parameters.
Regarding reproducer:
On Frontier, the stalling behavior is also modulated by the libfabrics version. I had the most "luck" with libfabric 1.20.1 w/ CXI provider, rocm 6.2.4, and aws-ofi-rccl commit 17d41cbf5618536c4b1076d29748416ab307040f
I have a local reproducer code I shared with AMD developers, and which uses jax, but it is not yet minimal.
Approval Checklist
Do not approve until these items are satisfied.