You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's the stack trace from a rank that failed that assertion:
0x00007fd5efcbf121 inclock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#0 0x00007fd5efcbf121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fd5efcc4e43 in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fd5efcc4d5a in sleep () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fd5efa92d2a in opal_delay_abort () at ../../../../../opal/util/error.c:230
delay = -1
pid = 30618
msg = "[nid00"...
#4 0x00007fd5efa9d649 in show_stackframe (signo=6, info=0x7ffd695af470, p=0x7ffd695af340) at ../../../../../opal/util/stacktrace.c:498
print_buffer = "[nid00"...
tmp = 0x7ffd695aef3b ""
size = 949
ret = 47
si_code_str = 0x7fd5efb4be78 ""#5 <signal handler called>
No symbol table info available.
#6 0x00007fd5efc2dd2b in raise () from /lib64/libc.so.6
No symbol table info available.
#7 0x00007fd5efc2f3e5 in abort () from /lib64/libc.so.6
No symbol table info available.
#8 0x00007fd5efc25c6a in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#9 0x00007fd5efc25cf2 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#10 0x00007fd5f08564a6 in _opal_list_append (list=0x7fd5f0d5bda0 <ompi_comm_requests_active>, item=0x6eaa78, FILE_NAME=0x7fd5f0c10db8 "../.."..., LINENO=186) at ../../../../opal/class/opal_list.h:545
sentinel = 0x7fd5f0d5bdc8 <ompi_comm_requests_active+40>
__PRETTY_FUNCTION__ = "_opal"...
#11 0x00007fd5f0857612 in ompi_comm_request_start (request=0x6eaa78) at ../../../../ompi/communicator/comm_request.c:186
No locals.
#12 0x00007fd5f0855c30 in ompi_comm_ft_allreduce_intra_nb (inbuf=0x71c894, outbuf=0x71c890, count=1, op=0x432580 <ompi_mpi_op_max>, cid_context=0x71c840, req=0x7ffd695b0460) at ../../../../ompi/communicator/comm_cid.c:1723
rc = 0
context = 0x7581e0
request = 0x6eaa78
subreq = 0x72d080
comm = 0x47c690
__PRETTY_FUNCTION__ = "ompi_"...
failed_group = 0x758208
#13 0x00007fd5f0852589 in ompi_comm_allreduce_getnextcid (request=0x6eaa78) at ../../../../ompi/communicator/comm_cid.c:688
context = 0x71c840
my_id = 34359738368
subreq = 0x7ffd695b0490
flag = true
ret = 0
participate = 1
#14 0x00007fd5f085740f in ompi_comm_request_progress () at ../../../../ompi/communicator/comm_request.c:154
request_item = 0x754fd0
item_complete = 1
rc = 0
request = 0x6eaa78
next = 0x7fd5f0d5bdc8 <ompi_comm_requests_active+40>
progressing = 1
completed = 0
__PRETTY_FUNCTION__ = "ompi_"...
#15 0x00007fd5efa590ed in opal_progress () at ../../../../opal/runtime/opal_progress.c:224
num_calls = 553242877
i = 3
events = 1
#16 0x00007fd5f0850058 in ompi_request_wait_completion (req=0x6eaa78) at ../../../../ompi/request/request.h:493
__PRETTY_FUNCTION__ = "ompi_"...
#17 0x00007fd5f0852366 in ompi_comm_nextcid (newcomm=0x754df0, comm=0x47c690, bridgecomm=0x0, arg0=0x0, arg1=0x0, send_first=true, mode=2048) at ../../../../ompi/communicator/comm_cid.c:632
req = 0x6eaa78
rc = 0
#18 0x00007fd5f08590e0 in ompi_comm_shrink_internal (comm=0x47c690, newcomm=0x7ffd695b0708) at ../../../../ompi/communicator/ft/comm_ft.c:342
rc = 0
exit_status = 0
flag = 1
failed_group = 0x70ac80
comm_group = 0x751fe0
alive_group = 0x755040
alive_rgroup = 0x0
newcomp = 0x754df0
mode = 2048
start = 49.176719257000002
stop = 49.176719192999997
__PRETTY_FUNCTION__ = "ompi_"...
#19 0x00007fd5f0c06ef9 in MPIX_Comm_shrink (comm=0x47c690, newcomm=0x7ffd695b0708) at ../../../../../../../ompi/mpiext/ftmpi/c/comm_shrink.c:48
rc = 0
...
Note that in frame 14 we are working on progress for request 0x6eaa78, but in frame 12 that same request was pulled from the free list and returned by ompi_comm_request_get. So I think the problem is that, somehow, the request is in both the ompi_comm_requests free_list and the ompi_comm_requests_active list at the same time.
You can also see that this request is first obtained in frame 17 by ompi_comm_nextcid.
I'll try to get a good MWE going, but I'm hoping this info helps point in the right direction.
The text was updated successfully, but these errors were encountered:
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
main b79b3e9
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.08e41ed 3rd-party/openpmix (v1.1.3-4067-g08e41ed5)
30cadc6746ebddd69ea42ca78b964398f782e4e3 3rd-party/prrte (psrvr-v2.0.0rc1-4839-g30cadc6746)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
--mca btl tcp,sm,self
Details of the problem
MPIX_Comm_shrink intermittently never returns, despite all ranks participating (verified with GDB). I'm using these MCA paramaters:
and these configuration flags:
This is true with and without configuring with
--debug
, but with debug I also get the following error from a nondeterministic rank:../../../../opal/class/opal_list.h:545: _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed.
Here's the stack trace from a rank that failed that assertion:
Note that in frame 14 we are working on progress for request
0x6eaa78
, but in frame 12 that same request was pulled from the free list and returned byompi_comm_request_get
. So I think the problem is that, somehow, the request is in both theompi_comm_requests
free_list and theompi_comm_requests_active
list at the same time.You can also see that this request is first obtained in frame 17 by
ompi_comm_nextcid
.I'll try to get a good MWE going, but I'm hoping this info helps point in the right direction.
The text was updated successfully, but these errors were encountered: