-
Notifications
You must be signed in to change notification settings - Fork 947
Description
We are trying to merge a new shm architecture into OFI but are having issues with the accumulate call in OMPI. Based on my debugging, I suspect something is broken in the OMPI accumulate path that is causing more issues with the new shm architecture because the new architecture changes the completion ordering and I think maybe OMPI relies on the ordering to drive progress. Here’s an overview of the issue:
Upstream OFI with OMPI (efa+shm):
#-----------------------------------------------------------------------------
# Benchmarking Accumulate
# #processes = 32
#-----------------------------------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] defects
0 1000 56.20 62.06 59.44 0.00
4 1000 2433.36 2486.45 2461.23 0.00
8 1000 2404.58 2438.60 2421.32 0.00
16 1000 2417.30 2458.71 2438.72 0.00
32 1000 2314.42 2346.96 2330.61 0.00
64 1000 2491.82 2531.65 2511.16 0.00
128 1000 2367.64 2403.49 2384.84 0.00
256 1000 2383.21 2428.62 2405.55 0.00
512 1000 2400.89 2435.05 2417.96 0.00
1024 1000 2422.63 2460.42 2441.59 0.00
2048 1000 2398.39 2444.84 2422.02 0.00
4096 1000 13632.79 13676.66 13654.09 0.00
8192 574 17468.06 17503.95 17486.15 0.00
16384 574 17797.57 17839.96 17818.00 0.00
32768 407 25213.94 25251.19 25233.00 0.00
65536 311 29203.80 29247.52 29225.65 0.00
131072 194 53944.97 53988.14 53966.53 0.00
262144 107 86118.28 86162.16 86140.37 0.00
524288 61 169526.54 169576.58 169552.20 0.00
1048576 33 322563.62 322615.76 322589.60 0.00
2097152 17 630014.72 630069.43 630041.98 0.00
4194304 9 1232652.46 1232718.93 1232682.76 0.00
New OFI shm with OMPI (efa+shm):
#-----------------------------------------------------------------------------
# Benchmarking Accumulate
# #processes = 32
#-----------------------------------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] defects
0 1000 55.10 59.33 57.66 0.00
4 1000 2188.70 2225.00 2205.54 0.00
8 1000 19728.09 19762.60 19744.13 0.00
16 1000 22701.97 22741.11 22721.46 0.00
32 1000 11899.94 11932.85 11916.21 0.00
64 1000 25942.05 25978.16 25959.56 0.00
128 1000 40955.93 40992.47 40973.34 0.00
256 1000 2235.33 2270.66 2252.68 0.00
512 1000 86874.21 86910.03 86893.27 0.00
1024 438 31918.45 31954.87 31937.56 0.00
2048 438 47810.38 47845.77 47828.39 0.00
4096 330 41528.28 41569.20 41546.76 0.00
8192 330 26340.39 26380.36 26360.13 0.00
16384 330 13867.52 13906.41 13886.15 0.00
32768 330 23742.72 23788.30 23765.49 0.00
65536 330 28680.23 28721.03 28700.80 0.00
131072 223 46893.69 46936.12 46915.46 0.00
262144 126 77982.03 78025.16 78003.64 0.00
524288 69 147472.18 147521.33 147496.89 0.00
1048576 36 280354.82 280401.88 280378.37 0.00
2097152 19 1274339.18 1274397.06 1274368.45 0.00
4194304 10 3307915.48 3307982.50 3307948.00 0.00
It looks like what OMPI is doing is using an OFI atomic cswap to control a lock for the memory region and then each process (doing an accumulate on memory in rank 0) does an fi_read. In new shm, it looks like the read from 0->0 takes way longer on new shm. I suspect this is because the completion ordering is off and OMPI is not driving progress enough (which shm requires being the source and target of any operation). One of the things the shm implementation changed is it allowed out of order send completions in specific cases and I think this architecture change is causing this issue.
That said, I think the OMPI accumulate implementation could be improved in general. The performance I’m seeing with OMPI accumulate looks really poor (even with the upstream shm and turning shm off and just using efa). As a reference point, here is data from Intel MPI (internal shm turned off so it should be taking the same intranode path as the above data):
#-----------------------------------------------------------------------------
# Benchmarking Accumulate
# #processes = 32
#-----------------------------------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] defects
0 1000 74.06 114.19 98.74 0.00
4 1000 133.08 160.12 146.35 0.00
8 1000 129.06 166.61 149.93 0.00
16 1000 119.36 169.15 147.67 0.00
32 1000 128.65 173.21 150.05 0.00
64 1000 127.99 167.71 145.36 0.00
128 1000 132.65 169.22 152.13 0.00
256 1000 131.33 179.38 153.42 0.00
512 1000 126.57 160.08 146.46 0.00
1024 1000 142.35 168.87 155.92 0.00
2048 1000 154.30 170.90 159.81 0.00
4096 1000 162.49 206.76 182.14 0.00
8192 1000 186.19 218.22 201.09 0.00
16384 1000 7865.38 7913.69 7901.05 0.00
32768 902 9775.29 9827.00 9806.96 0.00
65536 640 5114.66 5164.30 5141.59 0.00
131072 320 5840.89 5892.36 5867.85 0.00
262144 160 7880.56 7927.96 7913.96 0.00
524288 80 6704.88 6755.86 6734.52 0.00
1048576 40 7281.14 7330.77 7309.33 0.00
2097152 20 12235.04 12289.93 12271.67 0.00
4194304 10 31176.37 31191.63 31181.43 0.00
This data is the same with the upstream shm implementation and the new one. As you can see the data with OMPI is significantly worse that IMPI. This is true across all providers so I think the OMPI accumulate implementation can be improved in general. No other benchmark, API, or MPI show any issues with the new shm architecture so I think it isolated to the OMPI accumulate, specifically.
Hoping to understand the accumulate implementation more so we can both fix the OMPI performance across providers and make the new shm performance more stable.
I'm working with the OMPI source, right now with 5.0.6 but I don't see any changes upstream that would suggest it would be different. I'm just having issues building the latest because of flex dependencies