Remove MPI_Barriers before routing to increase speed. #846

JoshCu · 2024-07-02T22:08:04Z

Improve ngen-parallel Performance with T-route

Problem Statement

When running ngen-parallel with -n near or equal to core count, T-route's performance is severely degraded. T-route doesn't parallelize well with MPI, and moving the finalize step to before the routing frees up CPU resources for T-route to use different parallelization strategies.

Changes Made

Moved MPI_Finalize before the routing step
Applied additional T-route performance patch update csv to parquet to use multiprocessing t-route#795

The troute change is semi related as it converts a for loop that consumes the majority of t-route's execution time to use multiprocessing. That performance improvement doesn't work while MPI wait is consuming every CPU cycle.

Performance Results

Testing setup: ~6500 catchments, 24 timesteps, dual Xeon 28C 56T total (same as #843)

Configuration	NGen::routing Time (s)
Serial	13.9106
Parallel (-n 96 unpatched)	37.2
Parallel (-n 56 unpatched)	21.6952
Parallel (-n 56 patched)	10.761
Parallel (-n 56 patched + T-route perf patch)	~6 (more testing needed)

96 was performed on a different 96 core machine
56 were all performed on a 56 core machine

Explanation

T-route has a step that transposes per catchment/nexus files into per timestep files.
This transpose step is the longest part of T-route's execution and scales poorly with more catchments or timesteps.
In the unpatched version, all ranks except zero progress to MPI_Finalize and then begin polling while waiting for the routing thread to finish.
This polling maxes out the CPU on every rank while rank 0 computes the routing.
Moving Finalize before routing ensures all threads are finished before routing begins, preventing CPU maxout.

Future Work

Consider reworking the file transposition step for more efficient I/O (out of scope for this change)

Testing

Used the same setup as Update Formulation_Manager.hpp to use boost regex #843
Manual runs show consistent improvement, but automated testing may be required for more accurate results (especially for simulation that shows a ~10% improvement with a similar deviation)

Performance Visualization

Perf flamegraph of the entire ngen run (unpatched), should be interactive if downloaded

Additional Notes

Moving the finalize before timing log output appears to make the ngen simulation step ~10% faster, but may affect timing accuracy.
This use case exaggerates T-route performance degradation due to using 56 MPI ranks on 56 cores.
ngen was built using mpich 3.3.2. I saw online that changing the MPI wait policy to poll less frequently should help, but I couldn't get it to work
I'm not entirely sure of the implications of calling return 0; right after finalize for mpi_rank != 0;. I thought finalize would kill/return the subprocesses but without explicitly returning them I got segfaults. Reccomendations seem to be that as little as possible should be performed after calling finalize, but seeing as all computation after that point is done by one thread I can just return the others?

Next Steps

Conduct automated benchmarks to verify performance improvements to troute
Conduct automated benchmarks to verify performance improvements to ngen simulation

PhilMiller · 2024-07-03T14:51:46Z

I really like the idea here, but there's a fundamental challenge. As MPI is specified, we can't actually be confident that anything after a call to MPI_Finalize actually runs. I'll take a deeper look at whether there's a nicer way we can do this.

PhilMiller · 2024-07-08T22:18:38Z

Ok, I've looked at this a little bit more, and here's my tentative suggestion:

Rather than having all of the non-0 processes finalize and exit after the catchment simulation is complete, have them all call MPI_Ibarrier, followed by a loop of MPI_Test on the Ibarrier request and sleep for a decent time slice (e.g. 100 ms) until it's complete. Rank 0 would only enter the barrier after completing routing.

Is that something you want to try to implement? If not, I can find some time to work up an alternative patch. In the latter case, do you mind if I push it to your branch for this PR?

Ultimately, we expect to more deeply integrate t-route with parallelization directly into ngen with BMI. Until that's implemented, though, it's pretty reasonable to find expedient ways to improve performance.

JoshCu · 2024-07-10T17:40:06Z

Yeah I'll give it a go :) I might not get to it for another day or so but I'm happy to do it

JoshCu · 2024-07-18T22:52:26Z

This version works so far as the waiting mpi threads now no longer max out the CPU, but it's tricky to say what I'm actually measuring in terms of performance since this went through NOAA-OWP/t-route#795
I need to run a test on some data with more than 25 timesteps as the difference with and without this fix is ~<1s, ~9s vs ~10s.

I also wasn't sure if manager->finalize(); could go before the barrier or after it? If the routing was using MPI, would we need to wait for all threads to reach the post routing barrier before calling finalize?

JoshCu · 2024-07-19T15:21:24Z

Testing this on some longer runs, ~6500 catchments cfe + noaa owp for 6 months, shows that speedup isn't as dramatic now that the routing has that multiprocessing change in it.

Before MPI change

Finished routing
NGen top-level timings:
	NGen::init: 21.621
	NGen::simulation: 537.511
	NGen::routing: 837.387

real	23m17.312s
user	1836m41.492s
sys	102m31.267s

After change

	NGen::init: 21.6584
	NGen::simulation: 529.272
	NGen::routing: 783.689

real	22m15.342s
user	640m16.918s
sys	164m18.954s

JoshCu · 2024-09-05T18:48:43Z

Performance Testing Results

In summary: it makes routing ~1.3x-1.4x faster
Apologies for the inconsistent performance testing, I was using a terrible t-route configuration. I've adjusted it some more to run as fast as I can get it then reran some larger tests

Test Setup

Cores: 96
Subnetwork target size: 100 (determined to be the fastest from multiple tests)
t-route output format: NetCDF
Catchments: 60,000
Time period: 240 hours (in five-minute timesteps in t-route)

Results

t-route run standalone in NGIAB
- Method: Running python -m nwm_routing manually
- Execution time: 115 seconds
t-route run via ngen in NGIAB using ngen master
- Execution time: 161 seconds
t-route run via ngen in NGIAB using this PR
- Execution time: 120 seconds

2024-09-04 23:27:24,582 - root - INFO - [__main__.py:340 - main_v04]: ************ TIMING SUMMARY ************
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:341 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:342 - main_v04]: Network graph construction: 20.63 secs, 17.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:349 - main_v04]: Forcing array construction: 29.57 secs, 24.55 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:356 - main_v04]: Routing computations: 59.15 secs, 49.09 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:363 - main_v04]: Output writing: 10.99 secs, 9.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:370 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:371 - main_v04]: Total execution time: 120.33999999999999 secs
Finished routing
NGen top-level timings:
	NGen::init: 103.262
	NGen::simulation: 234.654
	NGen::routing: 120.578

and that init time can be reduced to 26 seconds this master...JoshCu:ngen:open_files_first_ask_questions_later

JoshCu · 2024-10-22T18:12:17Z

Recording this here with for reference:
I've been doing other testing and the first mpi barrier after model execution, before the routing also has the same issue. On a single machine, when a rank completes simulation it polls as fast as possible, pinning that core at 100% which results in my machine lowering the clock speed of the cpu. It's only slower because of the reduction in clock speed and not as a result of other processes being unable to use the cores (like with t-route) so it's <5% slowdown. It's a marginal difference on a desktop and I'm guessing this wouldn't be a problem on servers with fixed clock speed cpus. Might save some electricity though!

Update NGen.cpp Remove MPI_BARRIERs

5a8aa8b

PhilMiller self-assigned this Jul 3, 2024

PhilMiller self-requested a review July 3, 2024 14:52

JoshCu mentioned this pull request Jul 3, 2024

update csv to parquet to use multiprocessing NOAA-OWP/t-route#795

Merged

Manually sleep mpi threads before finalize

f4025a2

Merge branch 'NOAA-OWP:master' into mpi_routing

733ce34

JoshCu mentioned this pull request Oct 28, 2024

MPI Scaling fix #894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove MPI_Barriers before routing to increase speed. #846

Remove MPI_Barriers before routing to increase speed. #846

JoshCu commented Jul 2, 2024 •

edited

Loading

PhilMiller commented Jul 3, 2024

PhilMiller commented Jul 8, 2024

JoshCu commented Jul 10, 2024

JoshCu commented Jul 18, 2024

JoshCu commented Jul 19, 2024

JoshCu commented Sep 5, 2024 •

edited

Loading

JoshCu commented Oct 22, 2024

Remove MPI_Barriers before routing to increase speed. #846

Are you sure you want to change the base?

Remove MPI_Barriers before routing to increase speed. #846

Conversation

JoshCu commented Jul 2, 2024 • edited Loading

Improve ngen-parallel Performance with T-route

Problem Statement

Changes Made

Performance Results

Explanation

Future Work

Testing

Performance Visualization

Additional Notes

Next Steps

PhilMiller commented Jul 3, 2024

PhilMiller commented Jul 8, 2024

JoshCu commented Jul 10, 2024

JoshCu commented Jul 18, 2024

JoshCu commented Jul 19, 2024

Before MPI change

After change

JoshCu commented Sep 5, 2024 • edited Loading

Performance Testing Results

Test Setup

Results

JoshCu commented Oct 22, 2024

JoshCu commented Jul 2, 2024 •

edited

Loading

JoshCu commented Sep 5, 2024 •

edited

Loading