You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CabanaMD with the standard in.lj testcase crashes on both LLNL Lassen (spectrum MPI or mvapich2) and LANL Chicoma (craypich) when communicating between GPUs on the same node. It works when communicating inter-node, though I expect this is because MPI is not being as strict in error checking for data sending as the RMA routines MPI uses for intra-node communication. I've enabled GPU-aware communication in all cases.
The MPI_Send call invoked by Cabana::Gather::apply() (line 335 of Cabana_Halo.cpp) appears to be what is crashing. Here's the Lassen lwcore traceback from spectrum MPI:
Thanks for the details - I'll test this out when I'm back from travel next week. Looks like I also need to manually restart the CI periodically to try to catch this type of bug
Unclear on more exploration that this is a CabanaMD problem. I'm seeing multiple cases where small GPU-GPU intranode sends are crashing on those systems, but haven't yet been able to isolate. I'll update as I find out more.
CabanaMD with the standard in.lj testcase crashes on both LLNL Lassen (spectrum MPI or mvapich2) and LANL Chicoma (craypich) when communicating between GPUs on the same node. It works when communicating inter-node, though I expect this is because MPI is not being as strict in error checking for data sending as the RMA routines MPI uses for intra-node communication. I've enabled GPU-aware communication in all cases.
The MPI_Send call invoked by Cabana::Gather::apply() (line 335 of Cabana_Halo.cpp) appears to be what is crashing. Here's the Lassen lwcore traceback from spectrum MPI:
The text was updated successfully, but these errors were encountered: