Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAMxx: refactor field manipulation interfaces, and add new ones #6970

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Feb 4, 2025

Simplify implementation of the scale/scale_inv method, and add new ones for max/min.

[BFB]

First, the current implementation had several methods implemented that could just be expressed as a single call to update, if only update accepted different combine mode values (like update_impl did). The solution was simple: template update on the combine mode, defaulting to the current behavior (i.e., ScaleUpdate). With this modification, scale, scale_impl and also deep_copy (from a Field rhs) could be rewritten as a 1-liner call to update.

Second, the PR adds two more values for CombineMode, namely Max and Min. The goal is to later use Field operations inside scorpio_output.cpp, to avoid manually handling the same kind of stuff in that class, simplifying its implementation.

Finally, allow do do stuff like y.scale(x) if y's data type is, say, double but x's data type is int (but not the other way around!).

@bartgol bartgol added EAMxx PRs focused on capabilities for EAMxx code usability code cleanup labels Feb 4, 2025
@bartgol bartgol requested a review from tcclevenger February 4, 2025 00:43
@bartgol bartgol self-assigned this Feb 4, 2025
Copy link
Contributor

@tcclevenger tcclevenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int->double conversions are going to be nice!

@bartgol
Copy link
Contributor Author

bartgol commented Feb 4, 2025

@tcclevenger I added a few more improvements:

  • Use MDRange instead of Range+modularArithmetics: I chatted with Jim about this (he faced this choice in RRTMGP a lot) as well as kokkos folks. I think the idea is that (with some caveats) MD should be faster if the tiling is right, and especially for small kernels, where the cost of doing index arithmetic may be comparable (or larger) than the actual calculation. I hacked in some timing for a rank-3 field update (200 cols, 2 components, 128 levs) and indeed MD range was faster (almost 2x). It may be that the pb was too small, or other factors, but since it wasn't noticeably slower, I opted for clarity and less code. Maybe one day I'll do that in the field property checks too, and do away with the unflatten_idx utility, which is the only reason why FieldLayout stores a kokkos view for the extents.
  • Treat x and y separately when it comes to contiguity. This causes a duplication of template code instantiations, which is driving up compile time, but it makes sense to take advantage of contiguity if one of the two fields is contiguous, rather than use strided views for both...

The biggest rework compared to earlier is the CombineViewsHelper struct. I needed that on CUDA, since apparently you can't use an if constexpr inside a host-device lambda. So I had to create a functor.

@bartgol bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from 09f6f77 to 094230b Compare February 5, 2025 00:03
@bartgol bartgol requested a review from tcclevenger February 5, 2025 20:11
components/eamxx/src/share/field/field.hpp Outdated Show resolved Hide resolved
@bartgol bartgol requested a review from tcclevenger February 10, 2025 23:41
@bartgol
Copy link
Contributor Author

bartgol commented Feb 11, 2025

@tcclevenger I'm adding some more stuff. Since compilation times are going up a lot, I am adding ETI for the update method. The combinations are quite a few, but mostly b/c of the fine grain combinations in CombineMode:

enum class CombineMode {
  ScaleUpdate,  // out = beta*out + alpha*in (most generic case)
  Update,       // out = beta*out + in (special case of ScaleUpdate wiht alpha=1)
  ScaleAdd,     // out = out + alpha*in (special case of ScaleUpdate with beta=1)
  ScaleReplace, // out = alpha*in (special case of ScaleUpdate with beta=0)
  Add,          // out = out + in (special case of ScaleUpdate with alpha=beta=1)
  Rescale,      // out = beta*out
  Replace,      // out = in
  Multiply,     // out = out*in
  Divide,       // out = out/in
  Max,          // out = max(out,in)
  Min           // out = min(out,in)
};

I was thinking: what if we remove all except the last four and the first? All others can be expressed in terms of ScaleUpdate (which we may just rename "Update" for simplicity). The reason I added all of those was to allow a single interface (which used alpha and beta), but guarantee zero overhead when the combine mode did not require that coefficient (e.g., avoid doing y=b*y+a*x when b=0 and a=1, and just do y=x). But I think these are very tiny optimization, and the perf-critical part of the code does not use Field::udpate calls, so maybe it's ok to just have

enum class CombineMode {
  Update,       // out = beta*out + alpha*in
  Multiply,     // out = alpha*out*in
  Divide,       // out = beta*out/in
  Max,          // out = max(out,in)
  Min           // out = min(out,in)
};

This would drastically reduce the number of ETI calls we do (by a factor 6/11). These remaining 5 cannot be expressed in terms of each other, so it clearly cannot be reduced further.

@bartgol bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from e064584 to 44d44f5 Compare February 11, 2025 06:28
@bartgol
Copy link
Contributor Author

bartgol commented Feb 11, 2025

Ok, I decidedto pull the trigger on reducing CombineMode options, as well as adding some ETI for Field methods, which started to grow. I ran some timing tests on my laptop (not a great data point, but still):

without ETI:

$ time make field/field.cpp.o
[ 52%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field.cpp.o

real	0m29.734s
user	0m28.243s
sys	0m1.414s
$ time make field_tests.cpp.o
Building CXX object src/share/tests/CMakeFiles/field.dir/field_tests.cpp.o

real	2m21.124s
user	2m13.640s
sys	0m6.403s

with ETI:

$ time make field/field.cpp.o
[ 47%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field.cpp.o

real	0m36.233s
user	0m34.375s
sys	0m1.697s
$ time make field/field_eti_host.cpp.o
[ 47%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field_eti_host.cpp.o

real	0m22.793s
user	0m21.837s
sys	0m0.784s
$ time make field/field_eti_device.cpp.o
[ 52%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field_eti_device.cpp.o

real	0m54.430s
user	0m50.154s
sys	0m4.040s
$ time make field_tests.cpp.o
[ 52%] Building CXX object src/share/tests/CMakeFiles/field.dir/field_tests.cpp.o

real	1m0.103s
user	0m57.148s
sys	0m2.564s

The compilation of the templated methods for HostOrDevice::Device is what takes most of the time (that's b/c of the update methods, which have a lot of ramifications). I have only added ETI for update for Update/Multiply/Divide. The Max/Min impl are currently unused, although I plan to use them in the scorpio output class (I will just rely on implicit instantiation in that file).

Also, I split device and host ETI into two files, so that they can be built in parallel, reducing a bit the compilation time.

Notice that here I'm only showing the compilation time saved for field_tests.cpp.o, but all cpp.o files that include field.hpp and use these common ETI-ed methods will see some compilation time improvement. We are paying the price once in the compilation of field_eti_XYZ.cpp, but gaining across all the eamxx files that use field templated methods.

I will run a bit more precise experiments tomorrow.

@bartgol bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from 44d44f5 to 903848c Compare February 11, 2025 17:13
* ETI separately for int/float/double and host/device
* ETI separately each function
* Remove possibility of updating a field with another with
  different (but compatible) scalar type. It generates
  too many template instantiations
* Do not implement deep_copy in terms of update. It would prevent
  replacing NaN with valid values, since update's arithmetic
  implementation would propagate NaN's.

Signed-off-by: Luca Bertagna <[email protected]>
@bartgol bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from 85fbf10 to 8d577d4 Compare February 11, 2025 20:39
@bartgol
Copy link
Contributor Author

bartgol commented Feb 11, 2025

@tcclevenger this PR is growing into much more than it originally was. Yet, I think it's valuable. Here's a recap of all you will find:

  • Remove certain CombineMode enum values. They were particular cases of the y=ax+by update formula, for a or b being 0 or 1. Supporting these caused certain template methods to be compiled many more times, for virtually no gain. The performance gain of, say, avoiding a multiplication by 0 was likely minimal, especially since we don't really used these special cases much, and never in perf critical code.
  • Add min/max for CombineMode. This is not really needed here, but I am also rewriting scorpio_output to use field update methods (rather than roll pfor loops), and min/max will be needed there.
  • Implement scale/scale_inv in terms of the update function (since we no longer have CombineMode::Rescale).
  • In the field update method, we handle the case with fill value separately, by checking if the mask_value extra data is present, and calling udpate_impl with different compile-time args.
  • In Field::udpate, handle contiguity of lhs and rhs separately. That is, take advantage of contiguous views if possible (before we had either both using contiguous or both using strided)
  • The step above required adding a functor (CombineViewsHelper) to minimize src code duplication.
  • Add ETI for a few templated methods of Field. This may be the largest change. I decided to ETI some methods w.r.t to scalar type, host-or-device, and combine mode (where appropriate). For scalar type, I only ETI-ed int, float, and double. We could easily add pack types, for the default pack size, which may be also beneficial. I am wondering whether doing both float and double makes sense, since we currently never use anything other than Real. But the update methods, after checking the field data type, do some if/else stuff, and invoke get_view/get_strided_view also for float. Perhaps we could restrict usage to DataType=IntType and DataType=RealType; that may prove limiting if we ever explore mixed precision, but maybe it's the right thing to do for now. Let me know your thoughts.
  • The file field.hpp still includes field_impl.hpp, so that if the templated methods are invoked with non-ETI-ed parameters, the compiler can still implicitly instantiate them.

For the ETI, I clocked building the src folder (no tpls or system tests) on my workstation (haswell, 10 cores) with make -j10. I did that with ETI, as well as commenting out all the ETI-related stuff. The timings are:

  • with ETI: 6m 2s
  • without ETI: 11m 18s

Now, this was just my machine, and things may be quite different on other archs, but it definitely helped, so I expect it to help a bit on other platforms too.

Also, the PR has commits adding something, and then taking it back. Let me know if you'd rather me rewrite the git history in a cleaner way.

Cannot rely on Kokkos::deep_copy, since one view may be
padded and the other may be not padded.

Signed-off-by: Luca Bertagna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code cleanup code usability EAMxx PRs focused on capabilities for EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants