Skip to content

Make TracerModel thread parallel #6206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jakobtorben
Copy link
Contributor

Improve TracerModel performance with OpenMP parallelization

This PR improves the performance of the TracerModel by adding OpenMP parallelization over grid elements. The main changes include:

  1. In a similar manner as Switch property evaluation to using ElementChunks. #5912, use ElementChunks for efficient chunk-based parallelization of grid operations
  2. Added OpenMP parallelization to element loops in:
    • assembleTracerEquations_() - parallelizes the equation assembly process
    • updateStorageCache() - parallelizes the storage computation and updates
  3. Added early-exit checks to avoid unnecessary work for empty tracer batches

Scaling performance measurements (Norne)

Processors Pre-post Time (s) speedup
1 77.98 1
2 58.26 1.34
4 51.10 1.53
8 46.75 1.67
12 44.89 1.74

Comparison with master (Norne)

master

================    End of simulation     ===============

Number of MPI processes:         1
Threads per MPI process:        12
Setup time:                      2.54 s
  Deck input:                    1.77 s
Number of timesteps:           355
Simulation time:               205.26 s
  Assembly time:                21.42 s (Wasted: 0.5 s; 2.5%)
    Well assembly:               2.05 s (Wasted: 0.0 s; 2.1%)
  Linear solve time:            99.76 s (Wasted: 3.1 s; 3.1%)
    Linear setup:               46.46 s (Wasted: 1.5 s; 3.1%)
  Props/update time:            19.24 s (Wasted: 0.6 s; 3.0%)
  Pre/post step:                62.11 s (Wasted: 0.0 s; 0.1%)  <-- Target for optimization
  Output write time:             0.62 s
Overall Linearizations:       1656      (Wasted:    42; 2.5%)
Overall Newton Iterations:    1303      (Wasted:    42; 3.2%)
Overall Linear Iterations:    2294      (Wasted:    74; 3.2%)

Pre/post step time of 62.11 s (30.3% of total simulation time)

This PR

================    End of simulation     ===============

Number of MPI processes:         1
Threads per MPI process:        12
Setup time:                      2.03 s
  Deck input:                    1.30 s
Number of timesteps:           355
Simulation time:               187.32 s
  Assembly time:                21.11 s (Wasted: 0.5 s; 2.4%)
    Well assembly:               2.13 s (Wasted: 0.0 s; 2.1%)
  Linear solve time:            99.12 s (Wasted: 3.1 s; 3.1%)
    Linear setup:               46.05 s (Wasted: 1.4 s; 3.1%)
  Props/update time:            19.55 s (Wasted: 0.6 s; 3.0%)
  Pre/post step:                44.89 s (Wasted: 0.0 s; 0.1%)  <-- Target for optimization
  Output write time:             0.58 s
Overall Linearizations:       1656      (Wasted:    42; 2.5%)
Overall Newton Iterations:    1303      (Wasted:    42; 3.2%)
Overall Linear Iterations:    2294      (Wasted:    74; 3.2%)

Pre/post step time of 44.89 s (24.0% of total simulation time)

Profiling results

A breakdown of time spent in different parts is given below, which shows the first time step on Norne with 12 threads.

master

prepareStep pre_post_time: 0.015111
    Time to update storage cache: 0.00430116 seconds

afterStep pre_post_time: 0.129028
    wellModel_.endTimeStep() time: 0.00328097
    aquiferModel_.endTimeStep() time: 3.41e-07
    tracerModel_.endTimeStep() time: 0.107563
    
    Time to advance tracer fields: 0.107551 seconds
        Time to assemble tracer equations accounts for: 0.104322 seconds
    
    model().linearizer().updateFlowsInfo() time: 9.52e-07

this PR

prepareStep pre_post_time: 0.014785
    Time to update storage cache: 0.00280143 seconds

afterStep pre_post_time: 0.0692826
    wellModel_.endTimeStep() time: 0.00371866
    aquiferModel_.endTimeStep() time: 3.41e-07
    tracerModel_.endTimeStep() time: 0.047649
    
    Time to advance tracer fields: 0.0476362 seconds
        Time to assemble tracer equations accounts for: 0.0430794 seconds
        Time to solve tracer equations: 0.00043663 seconds
    
    model().linearizer().updateFlowsInfo() time: 1.132e-06

The majority of the time was spent in the tracer equation assembly 104 ms, which was reduced to 43 ms. However, even after the optimisations, this is still the largest contributor. Other ideas to further improve the thread parallelisation of the pre post section are welcomed.

@jakobtorben
Copy link
Contributor Author

jenkins build this please

@bska
Copy link
Member

bska commented Apr 25, 2025

PR #6037 did some early exploration in this area. Is this PR in any way related to that work?

#ifdef _OPENMP
#pragma omp parallel for
#endif
for (const auto& chunk : element_chunks_) {
Copy link
Member

@akva2 akva2 Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you actually benchmark this one. in my benches this is consistently a waste of resources as it is entirely memory bound. both on my 7975wx (bandwidth maxed) and on an epyc (bandwith somewhat limited).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did. It gave a good percentage improvement, but this is still quite part of the total pre post time, so doesn't help that much. This was ran on my AMD Ryzen 9 5900X 12-Core Processor, so a lot less memory bandwidth and memory channels as yours.

master

prepareStep pre_post_time: 0.015111
Time to update storage cache: 0.00430116 seconds

this PR

prepareStep pre_post_time: 0.014785
Time to update storage cache: 0.00280143 seconds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very interesting. on my box it went from 0.99 sec to 1.13 secs for the entire run (measured using tracy). shows how insanely difficult it is to optimize code these days.

@jakobtorben
Copy link
Contributor Author

I was not aware of that PR. Why has this not been merged in?

@akva2
Copy link
Member

akva2 commented Apr 25, 2025

mostly time. i've been working on the data structures to improve efficiency but consistently gotten side tracked by other things (support tickets, release, ..).

an efficiency of 66% for 2 threads is kinda mediocre. Last I left it it is I had increased to a speedup of 1.52 / efficiency of 76% but it's kinda hacky still.

@jakobtorben
Copy link
Contributor Author

Ahh I see, sorry for unawarly duplicating your work! We should aim to get one of them in then. Getting better thread parallelisation is important for when I do MPI serial runs with GPU acceleration.

@akva2
Copy link
Member

akva2 commented Apr 25, 2025

we can merge this as such, this part is entirely the same in my branch (not yet published), except i didn't thread the update since there are no gains.

@jakobtorben
Copy link
Contributor Author

We can also merge yours since you added this first. It doesn't matter to me. The most important is to get this in as I need the functionality now.

But yeah I agree it still not ideal. It is still far away from the MPI parallel pre post part, so there are room for more improvements:

image

@akva2
Copy link
Member

akva2 commented Apr 25, 2025

since you have prepared this part, let's just use this PR, instead of me having to do the cherry-picking dance from my branch to achieve the same.

@jakobtorben
Copy link
Contributor Author

Ok. Again sorry for duplicating, this was not my intention. I think I need to turn on notifications to stay on top of all new PRs...

I see that you have some other draft branches for thread parallelisation as well. Looking forward to further improvements.

Also if you see some more improvements here, please add on top. As you see from the comparison with MPI parallelisation there is still lots of potential.

@akva2
Copy link
Member

akva2 commented Apr 25, 2025

no worries, i'm in no way offended haha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants