Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MPI_Barriers before routing to increase speed. #846

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

JoshCu
Copy link

@JoshCu JoshCu commented Jul 2, 2024

Improve ngen-parallel Performance with T-route

Problem Statement

When running ngen-parallel with -n near or equal to core count, T-route's performance is severely degraded. T-route doesn't parallelize well with MPI, and moving the finalize step to before the routing frees up CPU resources for T-route to use different parallelization strategies.

Changes Made

The troute change is semi related as it converts a for loop that consumes the majority of t-route's execution time to use multiprocessing. That performance improvement doesn't work while MPI wait is consuming every CPU cycle.

Performance Results

Testing setup: ~6500 catchments, 24 timesteps, dual Xeon 28C 56T total (same as #843)

Configuration NGen::routing Time (s)
Serial 13.9106
Parallel (-n 96 unpatched) 37.2
Parallel (-n 56 unpatched) 21.6952
Parallel (-n 56 patched) 10.761
Parallel (-n 56 patched + T-route perf patch) ~6 (more testing needed)

96 was performed on a different 96 core machine
56 were all performed on a 56 core machine

Explanation

  1. T-route has a step that transposes per catchment/nexus files into per timestep files.
  2. This transpose step is the longest part of T-route's execution and scales poorly with more catchments or timesteps.
  3. In the unpatched version, all ranks except zero progress to MPI_Finalize and then begin polling while waiting for the routing thread to finish.
  4. This polling maxes out the CPU on every rank while rank 0 computes the routing.
  5. Moving Finalize before routing ensures all threads are finished before routing begins, preventing CPU maxout.

Future Work

  • Consider reworking the file transposition step for more efficient I/O (out of scope for this change)

Testing

Performance Visualization

Performance Flamegraph
Perf flamegraph of the entire ngen run (unpatched), should be interactive if downloaded

Additional Notes

  1. Moving the finalize before timing log output appears to make the ngen simulation step ~10% faster, but may affect timing accuracy.
  2. This use case exaggerates T-route performance degradation due to using 56 MPI ranks on 56 cores.
  3. ngen was built using mpich 3.3.2. I saw online that changing the MPI wait policy to poll less frequently should help, but I couldn't get it to work
  4. I'm not entirely sure of the implications of calling return 0; right after finalize for mpi_rank != 0;. I thought finalize would kill/return the subprocesses but without explicitly returning them I got segfaults. Reccomendations seem to be that as little as possible should be performed after calling finalize, but seeing as all computation after that point is done by one thread I can just return the others?

Next Steps

  • Conduct automated benchmarks to verify performance improvements to troute
  • Conduct automated benchmarks to verify performance improvements to ngen simulation

@PhilMiller
Copy link
Contributor

I really like the idea here, but there's a fundamental challenge. As MPI is specified, we can't actually be confident that anything after a call to MPI_Finalize actually runs. I'll take a deeper look at whether there's a nicer way we can do this.

@PhilMiller
Copy link
Contributor

Ok, I've looked at this a little bit more, and here's my tentative suggestion:

Rather than having all of the non-0 processes finalize and exit after the catchment simulation is complete, have them all call MPI_Ibarrier, followed by a loop of MPI_Test on the Ibarrier request and sleep for a decent time slice (e.g. 100 ms) until it's complete. Rank 0 would only enter the barrier after completing routing.

Is that something you want to try to implement? If not, I can find some time to work up an alternative patch. In the latter case, do you mind if I push it to your branch for this PR?

Ultimately, we expect to more deeply integrate t-route with parallelization directly into ngen with BMI. Until that's implemented, though, it's pretty reasonable to find expedient ways to improve performance.

@JoshCu
Copy link
Author

JoshCu commented Jul 10, 2024

Yeah I'll give it a go :) I might not get to it for another day or so but I'm happy to do it

@JoshCu
Copy link
Author

JoshCu commented Jul 18, 2024

This version works so far as the waiting mpi threads now no longer max out the CPU, but it's tricky to say what I'm actually measuring in terms of performance since this went through NOAA-OWP/t-route#795
I need to run a test on some data with more than 25 timesteps as the difference with and without this fix is ~<1s, ~9s vs ~10s.

I also wasn't sure if manager->finalize(); could go before the barrier or after it? If the routing was using MPI, would we need to wait for all threads to reach the post routing barrier before calling finalize?

@JoshCu
Copy link
Author

JoshCu commented Jul 19, 2024

Testing this on some longer runs, ~6500 catchments cfe + noaa owp for 6 months, shows that speedup isn't as dramatic now that the routing has that multiprocessing change in it.

Before MPI change

Finished routing
NGen top-level timings:
	NGen::init: 21.621
	NGen::simulation: 537.511
	NGen::routing: 837.387

real	23m17.312s
user	1836m41.492s
sys	102m31.267s

After change

	NGen::init: 21.6584
	NGen::simulation: 529.272
	NGen::routing: 783.689

real	22m15.342s
user	640m16.918s
sys	164m18.954s

@JoshCu
Copy link
Author

JoshCu commented Sep 5, 2024

Performance Testing Results

In summary: it makes routing ~1.3x-1.4x faster
Apologies for the inconsistent performance testing, I was using a terrible t-route configuration. I've adjusted it some more to run as fast as I can get it then reran some larger tests

Test Setup

  • Cores: 96
  • Subnetwork target size: 100 (determined to be the fastest from multiple tests)
  • t-route output format: NetCDF
  • Catchments: 60,000
  • Time period: 240 hours (in five-minute timesteps in t-route)

Results

  1. t-route run standalone in NGIAB

    • Method: Running python -m nwm_routing manually
    • Execution time: 115 seconds
  2. t-route run via ngen in NGIAB using ngen master

    • Execution time: 161 seconds
  3. t-route run via ngen in NGIAB using this PR

    • Execution time: 120 seconds
2024-09-04 23:27:24,582 - root - INFO - [__main__.py:340 - main_v04]: ************ TIMING SUMMARY ************
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:341 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:342 - main_v04]: Network graph construction: 20.63 secs, 17.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:349 - main_v04]: Forcing array construction: 29.57 secs, 24.55 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:356 - main_v04]: Routing computations: 59.15 secs, 49.09 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:363 - main_v04]: Output writing: 10.99 secs, 9.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:370 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:371 - main_v04]: Total execution time: 120.33999999999999 secs
Finished routing
NGen top-level timings:
	NGen::init: 103.262
	NGen::simulation: 234.654
	NGen::routing: 120.578

and that init time can be reduced to 26 seconds this master...JoshCu:ngen:open_files_first_ask_questions_later

@JoshCu
Copy link
Author

JoshCu commented Oct 22, 2024

Recording this here with for reference:
I've been doing other testing and the first mpi barrier after model execution, before the routing also has the same issue. On a single machine, when a rank completes simulation it polls as fast as possible, pinning that core at 100% which results in my machine lowering the clock speed of the cpu. It's only slower because of the reduction in clock speed and not as a result of other processes being unable to use the cores (like with t-route) so it's <5% slowdown. It's a marginal difference on a desktop and I'm guessing this wouldn't be a problem on servers with fixed clock speed cpus. Might save some electricity though!

@JoshCu JoshCu mentioned this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants