-
Notifications
You must be signed in to change notification settings - Fork 122
Make TracerModel thread parallel #6206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Make TracerModel thread parallel #6206
Conversation
jenkins build this please |
PR #6037 did some early exploration in this area. Is this PR in any way related to that work? |
#ifdef _OPENMP | ||
#pragma omp parallel for | ||
#endif | ||
for (const auto& chunk : element_chunks_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you actually benchmark this one. in my benches this is consistently a waste of resources as it is entirely memory bound. both on my 7975wx (bandwidth maxed) and on an epyc (bandwith somewhat limited).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I did. It gave a good percentage improvement, but this is still quite part of the total pre post time, so doesn't help that much. This was ran on my AMD Ryzen 9 5900X 12-Core Processor, so a lot less memory bandwidth and memory channels as yours.
master
prepareStep pre_post_time: 0.015111
Time to update storage cache: 0.00430116 seconds
this PR
prepareStep pre_post_time: 0.014785
Time to update storage cache: 0.00280143 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very interesting. on my box it went from 0.99 sec to 1.13 secs for the entire run (measured using tracy). shows how insanely difficult it is to optimize code these days.
I was not aware of that PR. Why has this not been merged in? |
mostly time. i've been working on the data structures to improve efficiency but consistently gotten side tracked by other things (support tickets, release, ..). an efficiency of 66% for 2 threads is kinda mediocre. Last I left it it is I had increased to a speedup of 1.52 / efficiency of 76% but it's kinda hacky still. |
Ahh I see, sorry for unawarly duplicating your work! We should aim to get one of them in then. Getting better thread parallelisation is important for when I do MPI serial runs with GPU acceleration. |
we can merge this as such, this part is entirely the same in my branch (not yet published), except i didn't thread the update since there are no gains. |
since you have prepared this part, let's just use this PR, instead of me having to do the cherry-picking dance from my branch to achieve the same. |
Ok. Again sorry for duplicating, this was not my intention. I think I need to turn on notifications to stay on top of all new PRs... I see that you have some other draft branches for thread parallelisation as well. Looking forward to further improvements. Also if you see some more improvements here, please add on top. As you see from the comparison with MPI parallelisation there is still lots of potential. |
no worries, i'm in no way offended haha. |
Improve TracerModel performance with OpenMP parallelization
This PR improves the performance of the TracerModel by adding OpenMP parallelization over grid elements. The main changes include:
ElementChunks
for efficient chunk-based parallelization of grid operationsassembleTracerEquations_()
- parallelizes the equation assembly processupdateStorageCache()
- parallelizes the storage computation and updatesScaling performance measurements (Norne)
Comparison with master (Norne)
master
Pre/post step time of 62.11 s (30.3% of total simulation time)
This PR
Pre/post step time of 44.89 s (24.0% of total simulation time)
Profiling results
A breakdown of time spent in different parts is given below, which shows the first time step on Norne with 12 threads.
master
this PR
The majority of the time was spent in the tracer equation assembly 104 ms, which was reduced to 43 ms. However, even after the optimisations, this is still the largest contributor. Other ideas to further improve the thread parallelisation of the pre post section are welcomed.