Make TracerModel thread parallel #6206

jakobtorben · 2025-04-25T08:21:36Z

Improve TracerModel performance with OpenMP parallelization

This PR improves the performance of the TracerModel by adding OpenMP parallelization over grid elements. The main changes include:

In a similar manner as Switch property evaluation to using ElementChunks. #5912, use ElementChunks for efficient chunk-based parallelization of grid operations
Added OpenMP parallelization to element loops in:
- assembleTracerEquations_() - parallelizes the equation assembly process
- updateStorageCache() - parallelizes the storage computation and updates
Added early-exit checks to avoid unnecessary work for empty tracer batches

Scaling performance measurements (Norne)

Processors	Pre-post Time (s)	speedup
1	77.98	1
2	58.26	1.34
4	51.10	1.53
8	46.75	1.67
12	44.89	1.74

Comparison with master (Norne)

master

================    End of simulation     ===============

Number of MPI processes:         1
Threads per MPI process:        12
Setup time:                      2.54 s
  Deck input:                    1.77 s
Number of timesteps:           355
Simulation time:               205.26 s
  Assembly time:                21.42 s (Wasted: 0.5 s; 2.5%)
    Well assembly:               2.05 s (Wasted: 0.0 s; 2.1%)
  Linear solve time:            99.76 s (Wasted: 3.1 s; 3.1%)
    Linear setup:               46.46 s (Wasted: 1.5 s; 3.1%)
  Props/update time:            19.24 s (Wasted: 0.6 s; 3.0%)
  Pre/post step:                62.11 s (Wasted: 0.0 s; 0.1%)  <-- Target for optimization
  Output write time:             0.62 s
Overall Linearizations:       1656      (Wasted:    42; 2.5%)
Overall Newton Iterations:    1303      (Wasted:    42; 3.2%)
Overall Linear Iterations:    2294      (Wasted:    74; 3.2%)

Pre/post step time of 62.11 s (30.3% of total simulation time)

This PR

================    End of simulation     ===============

Number of MPI processes:         1
Threads per MPI process:        12
Setup time:                      2.03 s
  Deck input:                    1.30 s
Number of timesteps:           355
Simulation time:               187.32 s
  Assembly time:                21.11 s (Wasted: 0.5 s; 2.4%)
    Well assembly:               2.13 s (Wasted: 0.0 s; 2.1%)
  Linear solve time:            99.12 s (Wasted: 3.1 s; 3.1%)
    Linear setup:               46.05 s (Wasted: 1.4 s; 3.1%)
  Props/update time:            19.55 s (Wasted: 0.6 s; 3.0%)
  Pre/post step:                44.89 s (Wasted: 0.0 s; 0.1%)  <-- Target for optimization
  Output write time:             0.58 s
Overall Linearizations:       1656      (Wasted:    42; 2.5%)
Overall Newton Iterations:    1303      (Wasted:    42; 3.2%)
Overall Linear Iterations:    2294      (Wasted:    74; 3.2%)

Pre/post step time of 44.89 s (24.0% of total simulation time)

Profiling results

A breakdown of time spent in different parts is given below, which shows the first time step on Norne with 12 threads.

master

prepareStep pre_post_time: 0.015111
    Time to update storage cache: 0.00430116 seconds

afterStep pre_post_time: 0.129028
    wellModel_.endTimeStep() time: 0.00328097
    aquiferModel_.endTimeStep() time: 3.41e-07
    tracerModel_.endTimeStep() time: 0.107563
    
    Time to advance tracer fields: 0.107551 seconds
        Time to assemble tracer equations accounts for: 0.104322 seconds
    
    model().linearizer().updateFlowsInfo() time: 9.52e-07

this PR

prepareStep pre_post_time: 0.014785
    Time to update storage cache: 0.00280143 seconds

afterStep pre_post_time: 0.0692826
    wellModel_.endTimeStep() time: 0.00371866
    aquiferModel_.endTimeStep() time: 3.41e-07
    tracerModel_.endTimeStep() time: 0.047649
    
    Time to advance tracer fields: 0.0476362 seconds
        Time to assemble tracer equations accounts for: 0.0430794 seconds
        Time to solve tracer equations: 0.00043663 seconds
    
    model().linearizer().updateFlowsInfo() time: 1.132e-06

The majority of the time was spent in the tracer equation assembly 104 ms, which was reduced to 43 ms. However, even after the optimisations, this is still the largest contributor. Other ideas to further improve the thread parallelisation of the pre post section are welcomed.

jakobtorben · 2025-04-25T08:21:50Z

jenkins build this please

bska · 2025-04-25T08:24:03Z

PR #6037 did some early exploration in this area. Is this PR in any way related to that work?

akva2 · 2025-04-25T08:24:54Z

opm/simulators/flow/TracerModel.hpp

+        #ifdef _OPENMP
+        #pragma omp parallel for
+        #endif
+        for (const auto& chunk : element_chunks_) {


did you actually benchmark this one. in my benches this is consistently a waste of resources as it is entirely memory bound. both on my 7975wx (bandwidth maxed) and on an epyc (bandwith somewhat limited).

Yes I did. It gave a good percentage improvement, but this is still quite part of the total pre post time, so doesn't help that much. This was ran on my AMD Ryzen 9 5900X 12-Core Processor, so a lot less memory bandwidth and memory channels as yours.

master

prepareStep pre_post_time: 0.015111
Time to update storage cache: 0.00430116 seconds

this PR

prepareStep pre_post_time: 0.014785
Time to update storage cache: 0.00280143 seconds

very interesting. on my box it went from 0.99 sec to 1.13 secs for the entire run (measured using tracy). shows how insanely difficult it is to optimize code these days.

jakobtorben · 2025-04-25T08:26:26Z

I was not aware of that PR. Why has this not been merged in?

akva2 · 2025-04-25T08:27:42Z

mostly time. i've been working on the data structures to improve efficiency but consistently gotten side tracked by other things (support tickets, release, ..).

an efficiency of 66% for 2 threads is kinda mediocre. Last I left it it is I had increased to a speedup of 1.52 / efficiency of 76% but it's kinda hacky still.

jakobtorben · 2025-04-25T08:30:23Z

Ahh I see, sorry for unawarly duplicating your work! We should aim to get one of them in then. Getting better thread parallelisation is important for when I do MPI serial runs with GPU acceleration.

akva2 · 2025-04-25T08:31:12Z

we can merge this as such, this part is entirely the same in my branch (not yet published), except i didn't thread the update since there are no gains.

jakobtorben · 2025-04-25T08:46:51Z

We can also merge yours since you added this first. It doesn't matter to me. The most important is to get this in as I need the functionality now.

But yeah I agree it still not ideal. It is still far away from the MPI parallel pre post part, so there are room for more improvements:

akva2 · 2025-04-25T08:49:04Z

since you have prepared this part, let's just use this PR, instead of me having to do the cherry-picking dance from my branch to achieve the same.

jakobtorben · 2025-04-25T09:00:07Z

Ok. Again sorry for duplicating, this was not my intention. I think I need to turn on notifications to stay on top of all new PRs...

I see that you have some other draft branches for thread parallelisation as well. Looking forward to further improvements.

Also if you see some more improvements here, please add on top. As you see from the comparison with MPI parallelisation there is still lots of potential.

akva2 · 2025-04-25T09:05:19Z

no worries, i'm in no way offended haha.

Make TracerModel thread parallel

bdfb876

akva2 reviewed Apr 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make TracerModel thread parallel #6206

Make TracerModel thread parallel #6206

jakobtorben commented Apr 25, 2025

jakobtorben commented Apr 25, 2025

bska commented Apr 25, 2025

akva2 Apr 25, 2025 •

edited

Loading

jakobtorben Apr 25, 2025

akva2 Apr 25, 2025

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025 •

edited

Loading

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025 •

edited

Loading

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025

Make TracerModel thread parallel #6206

Are you sure you want to change the base?

Make TracerModel thread parallel #6206

Conversation

jakobtorben commented Apr 25, 2025

Improve TracerModel performance with OpenMP parallelization

Scaling performance measurements (Norne)

Comparison with master (Norne)

master

This PR

Profiling results

master

this PR

jakobtorben commented Apr 25, 2025

bska commented Apr 25, 2025

akva2 Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

jakobtorben Apr 25, 2025

Choose a reason for hiding this comment

akva2 Apr 25, 2025

Choose a reason for hiding this comment

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025 • edited Loading

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025 • edited Loading

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025

jakobtorben commented Apr 25, 2025

akva2 commented Apr 25, 2025

akva2 Apr 25, 2025 •

edited

Loading

akva2 commented Apr 25, 2025 •

edited

Loading

akva2 commented Apr 25, 2025 •

edited

Loading