Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPIX_Comm_shrink intermittently never returns #13138

Open
Matthew-Whitlock opened this issue Mar 12, 2025 · 0 comments
Open

MPIX_Comm_shrink intermittently never returns #13138

Matthew-Whitlock opened this issue Mar 12, 2025 · 0 comments

Comments

@Matthew-Whitlock
Copy link

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

main b79b3e9

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

08e41ed 3rd-party/openpmix (v1.1.3-4067-g08e41ed5)
30cadc6746ebddd69ea42ca78b964398f782e4e3 3rd-party/prrte (psrvr-v2.0.0rc1-4839-g30cadc6746)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running

  • Operating system/version: SLES
  • Computer hardware: x86, using CPU only
  • Network type: Single node using --mca btl tcp,sm,self

Details of the problem

MPIX_Comm_shrink intermittently never returns, despite all ranks participating (verified with GDB). I'm using these MCA paramaters:

--mca opal_base_help_aggregate 0 --mca coll ^han --mca btl tcp,sm,self --mca pml ob1 --mca opal_abort_delay -1

and these configuration flags:

--enable-static --enable-shared --disable-oshmem --disable-mpi-fortran --with-slurm --with-ft --with-libevent=internal --with-hwloc=internal --with-prrte=internal --with-pmix=internal --disable-sphinx

This is true with and without configuring with --debug, but with debug I also get the following error from a nondeterministic rank:

../../../../opal/class/opal_list.h:545: _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed.

Here's the stack trace from a rank that failed that assertion:

0x00007fd5efcbf121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#0  0x00007fd5efcbf121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fd5efcc4e43 in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007fd5efcc4d5a in sleep () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007fd5efa92d2a in opal_delay_abort () at ../../../../../opal/util/error.c:230
        delay = -1
        pid = 30618
        msg = "[nid00"...
#4  0x00007fd5efa9d649 in show_stackframe (signo=6, info=0x7ffd695af470, p=0x7ffd695af340) at ../../../../../opal/util/stacktrace.c:498
        print_buffer = "[nid00"...
        tmp = 0x7ffd695aef3b ""
        size = 949 
        ret = 47
        si_code_str = 0x7fd5efb4be78 ""
#5  <signal handler called>
No symbol table info available.
#6  0x00007fd5efc2dd2b in raise () from /lib64/libc.so.6
No symbol table info available.
#7  0x00007fd5efc2f3e5 in abort () from /lib64/libc.so.6
No symbol table info available.
#8  0x00007fd5efc25c6a in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#9  0x00007fd5efc25cf2 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#10 0x00007fd5f08564a6 in _opal_list_append (list=0x7fd5f0d5bda0 <ompi_comm_requests_active>, item=0x6eaa78, FILE_NAME=0x7fd5f0c10db8 "../.."..., LINENO=186) at ../../../../opal/class/opal_list.h:545
        sentinel = 0x7fd5f0d5bdc8 <ompi_comm_requests_active+40>
        __PRETTY_FUNCTION__ = "_opal"...
#11 0x00007fd5f0857612 in ompi_comm_request_start (request=0x6eaa78) at ../../../../ompi/communicator/comm_request.c:186
No locals.
#12 0x00007fd5f0855c30 in ompi_comm_ft_allreduce_intra_nb (inbuf=0x71c894, outbuf=0x71c890, count=1, op=0x432580 <ompi_mpi_op_max>, cid_context=0x71c840, req=0x7ffd695b0460) at ../../../../ompi/communicator/comm_cid.c:1723
        rc = 0 
        context = 0x7581e0
        request = 0x6eaa78
        subreq = 0x72d080
        comm = 0x47c690
        __PRETTY_FUNCTION__ = "ompi_"...
        failed_group = 0x758208
#13 0x00007fd5f0852589 in ompi_comm_allreduce_getnextcid (request=0x6eaa78) at ../../../../ompi/communicator/comm_cid.c:688
        context = 0x71c840
        my_id = 34359738368
        subreq = 0x7ffd695b0490
        flag = true
        ret = 0 
        participate = 1 
#14 0x00007fd5f085740f in ompi_comm_request_progress () at ../../../../ompi/communicator/comm_request.c:154
        request_item = 0x754fd0
        item_complete = 1 
        rc = 0 
        request = 0x6eaa78
        next = 0x7fd5f0d5bdc8 <ompi_comm_requests_active+40>
        progressing = 1 
        completed = 0 
        __PRETTY_FUNCTION__ = "ompi_"...
#15 0x00007fd5efa590ed in opal_progress () at ../../../../opal/runtime/opal_progress.c:224
        num_calls = 553242877
        i = 3
        events = 1
#16 0x00007fd5f0850058 in ompi_request_wait_completion (req=0x6eaa78) at ../../../../ompi/request/request.h:493
        __PRETTY_FUNCTION__ = "ompi_"...
#17 0x00007fd5f0852366 in ompi_comm_nextcid (newcomm=0x754df0, comm=0x47c690, bridgecomm=0x0, arg0=0x0, arg1=0x0, send_first=true, mode=2048) at ../../../../ompi/communicator/comm_cid.c:632
        req = 0x6eaa78
        rc = 0
#18 0x00007fd5f08590e0 in ompi_comm_shrink_internal (comm=0x47c690, newcomm=0x7ffd695b0708) at ../../../../ompi/communicator/ft/comm_ft.c:342
        rc = 0
        exit_status = 0
        flag = 1
        failed_group = 0x70ac80
        comm_group = 0x751fe0
        alive_group = 0x755040
        alive_rgroup = 0x0
        newcomp = 0x754df0
        mode = 2048
        start = 49.176719257000002
        stop = 49.176719192999997
        __PRETTY_FUNCTION__ = "ompi_"...
#19 0x00007fd5f0c06ef9 in MPIX_Comm_shrink (comm=0x47c690, newcomm=0x7ffd695b0708) at ../../../../../../../ompi/mpiext/ftmpi/c/comm_shrink.c:48
        rc = 0
...

Note that in frame 14 we are working on progress for request 0x6eaa78, but in frame 12 that same request was pulled from the free list and returned by ompi_comm_request_get. So I think the problem is that, somehow, the request is in both the ompi_comm_requests free_list and the ompi_comm_requests_active list at the same time.

You can also see that this request is first obtained in frame 17 by ompi_comm_nextcid.

I'll try to get a good MWE going, but I'm hoping this info helps point in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants