OMPI OFI BTL accumulate buggy/not optimized

We are trying to merge a new shm architecture into OFI but are having issues with the accumulate call in OMPI. Based on my debugging, I suspect something is broken in the OMPI accumulate path that is causing more issues with the new shm architecture because the new architecture changes the completion ordering and I think maybe OMPI relies on the ordering to drive progress. Here’s an overview of the issue:

Upstream OFI with OMPI (efa+shm): 
```
#-----------------------------------------------------------------------------                                                                                                 
# Benchmarking Accumulate                                                                                                                                                      
# #processes = 32                                                                                                                                                              
#-----------------------------------------------------------------------------                                                                                                 
#                                                                                                                                                                              
#    MODE: AGGREGATE                                                                                                                                                           
#                                                                                                                                                                              
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects                                                                                                 
            0         1000        56.20        62.06        59.44         0.00                                                                                                 
            4         1000      2433.36      2486.45      2461.23         0.00                                                                                                 
            8         1000      2404.58      2438.60      2421.32         0.00                                                                                                 
           16         1000      2417.30      2458.71      2438.72         0.00                                                                                                 
           32         1000      2314.42      2346.96      2330.61         0.00                                                                                                 
           64         1000      2491.82      2531.65      2511.16         0.00                                                                                                 
          128         1000      2367.64      2403.49      2384.84         0.00                                                                                                 
          256         1000      2383.21      2428.62      2405.55         0.00                                                                                                 
          512         1000      2400.89      2435.05      2417.96         0.00                                                                                                 
         1024         1000      2422.63      2460.42      2441.59         0.00                                                                                                 
         2048         1000      2398.39      2444.84      2422.02         0.00                                                                                                 
         4096         1000     13632.79     13676.66     13654.09         0.00                                                                                                 
         8192          574     17468.06     17503.95     17486.15         0.00                                                                                                 
        16384          574     17797.57     17839.96     17818.00         0.00                                                                                                 
        32768          407     25213.94     25251.19     25233.00         0.00                                                                                                 
        65536          311     29203.80     29247.52     29225.65         0.00                                                                                                 
       131072          194     53944.97     53988.14     53966.53         0.00                                                                                                 
       262144          107     86118.28     86162.16     86140.37         0.00                                                                                                 
       524288           61    169526.54    169576.58    169552.20         0.00                                                                                                 
      1048576           33    322563.62    322615.76    322589.60         0.00                                                                                                 
      2097152           17    630014.72    630069.43    630041.98         0.00                                                                                                 
      4194304            9   1232652.46   1232718.93   1232682.76         0.00
```
New OFI shm with OMPI (efa+shm):
```
#-----------------------------------------------------------------------------
# Benchmarking Accumulate 
# #processes = 32 
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
            0         1000        55.10        59.33        57.66         0.00
            4         1000      2188.70      2225.00      2205.54         0.00
            8         1000     19728.09     19762.60     19744.13         0.00
           16         1000     22701.97     22741.11     22721.46         0.00
           32         1000     11899.94     11932.85     11916.21         0.00
           64         1000     25942.05     25978.16     25959.56         0.00
          128         1000     40955.93     40992.47     40973.34         0.00
          256         1000      2235.33      2270.66      2252.68         0.00
          512         1000     86874.21     86910.03     86893.27         0.00
         1024          438     31918.45     31954.87     31937.56         0.00
         2048          438     47810.38     47845.77     47828.39         0.00
         4096          330     41528.28     41569.20     41546.76         0.00
         8192          330     26340.39     26380.36     26360.13         0.00
        16384          330     13867.52     13906.41     13886.15         0.00
        32768          330     23742.72     23788.30     23765.49         0.00
        65536          330     28680.23     28721.03     28700.80         0.00
       131072          223     46893.69     46936.12     46915.46         0.00
       262144          126     77982.03     78025.16     78003.64         0.00
       524288           69    147472.18    147521.33    147496.89         0.00
      1048576           36    280354.82    280401.88    280378.37         0.00
      2097152           19   1274339.18   1274397.06   1274368.45         0.00
      4194304           10   3307915.48   3307982.50   3307948.00         0.00
```

It looks like what OMPI is doing is using an OFI atomic cswap to control a lock for the memory region and then each process (doing an accumulate on memory in rank 0) does an fi_read. In new shm, it looks like the read from 0->0 takes way longer on new shm. I suspect this is because the completion ordering is off and OMPI is not driving progress enough (which shm requires being the source and target of any operation). One of the things the shm implementation changed is it allowed out of order send completions in specific cases and I think this architecture change is causing this issue.
That said, I think the OMPI accumulate implementation could be improved in general. The performance I’m seeing with OMPI accumulate looks really poor (even with the upstream shm and turning shm off and just using efa). As a reference point, here is data from Intel MPI (internal shm turned off so it should be taking the same intranode path as the above data):
```
#-----------------------------------------------------------------------------                                                                                                 
# Benchmarking Accumulate                                                                                                                                                      
# #processes = 32                                                                                                                                                              
#-----------------------------------------------------------------------------                                                                                                 
#                                                                                                                                                                              
#    MODE: AGGREGATE                                                                                                                                                           
#                                                                                                                                                                              
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects                                                                                                 
            0         1000        74.06       114.19        98.74         0.00                                                                                                 
            4         1000       133.08       160.12       146.35         0.00                                                                                                 
            8         1000       129.06       166.61       149.93         0.00                                                                                                 
           16         1000       119.36       169.15       147.67         0.00                                                                                                 
           32         1000       128.65       173.21       150.05         0.00                                                                                                 
           64         1000       127.99       167.71       145.36         0.00                                                                                                 
          128         1000       132.65       169.22       152.13         0.00                                                                                                 
          256         1000       131.33       179.38       153.42         0.00                                                                                                 
          512         1000       126.57       160.08       146.46         0.00                                                                                                 
         1024         1000       142.35       168.87       155.92         0.00                                                                                                 
         2048         1000       154.30       170.90       159.81         0.00                                                                                                 
         4096         1000       162.49       206.76       182.14         0.00                                                                                                 
         8192         1000       186.19       218.22       201.09         0.00                                                                                                 
        16384         1000      7865.38      7913.69      7901.05         0.00                                                                                                 
        32768          902      9775.29      9827.00      9806.96         0.00                                                                                                 
        65536          640      5114.66      5164.30      5141.59         0.00                                                                                                 
       131072          320      5840.89      5892.36      5867.85         0.00                                                                                                 
       262144          160      7880.56      7927.96      7913.96         0.00                                                                                                 
       524288           80      6704.88      6755.86      6734.52         0.00                                                                                                 
      1048576           40      7281.14      7330.77      7309.33         0.00                                                                                                 
      2097152           20     12235.04     12289.93     12271.67         0.00                                                                                                 
      4194304           10     31176.37     31191.63     31181.43         0.00
```
This data is the same with the upstream shm implementation and the new one. As you can see the data with OMPI is significantly worse that IMPI. This is true across all providers so I think the OMPI accumulate implementation can be improved in general. No other benchmark, API, or MPI show any issues with the new shm architecture so I think it isolated to the OMPI accumulate, specifically.

Hoping to understand the accumulate implementation more so we can both fix the OMPI performance across providers and make the new shm performance more stable.

I'm working with the OMPI source, right now with 5.0.6 but I don't see any changes upstream that would suggest it would be different. I'm just having issues building the latest because of flex dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMPI OFI BTL accumulate buggy/not optimized #13740

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OMPI OFI BTL accumulate buggy/not optimized #13740

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions