EAMxx: refactor field manipulation interfaces, and add new ones #6970

bartgol · 2025-02-04T00:43:57Z

Simplify implementation of the scale/scale_inv method, and add new ones for max/min.

[BFB]

First, the current implementation had several methods implemented that could just be expressed as a single call to update, if only update accepted different combine mode values (like update_impl did). The solution was simple: template update on the combine mode, defaulting to the current behavior (i.e., ScaleUpdate). With this modification, scale, scale_impl and also deep_copy (from a Field rhs) could be rewritten as a 1-liner call to update.

Second, the PR adds two more values for CombineMode, namely Max and Min. The goal is to later use Field operations inside scorpio_output.cpp, to avoid manually handling the same kind of stuff in that class, simplifying its implementation.

Finally, allow do do stuff like y.scale(x) if y's data type is, say, double but x's data type is int (but not the other way around!).

tcclevenger

int->double conversions are going to be nice!

bartgol · 2025-02-04T22:16:38Z

@tcclevenger I added a few more improvements:

Use MDRange instead of Range+modularArithmetics: I chatted with Jim about this (he faced this choice in RRTMGP a lot) as well as kokkos folks. I think the idea is that (with some caveats) MD should be faster if the tiling is right, and especially for small kernels, where the cost of doing index arithmetic may be comparable (or larger) than the actual calculation. I hacked in some timing for a rank-3 field update (200 cols, 2 components, 128 levs) and indeed MD range was faster (almost 2x). It may be that the pb was too small, or other factors, but since it wasn't noticeably slower, I opted for clarity and less code. Maybe one day I'll do that in the field property checks too, and do away with the unflatten_idx utility, which is the only reason why FieldLayout stores a kokkos view for the extents.
Treat x and y separately when it comes to contiguity. This causes a duplication of template code instantiations, which is driving up compile time, but it makes sense to take advantage of contiguity if one of the two fields is contiguous, rather than use strided views for both...

The biggest rework compared to earlier is the CombineViewsHelper struct. I needed that on CUDA, since apparently you can't use an if constexpr inside a host-device lambda. So I had to create a functor.

components/eamxx/src/share/field/field.hpp

bartgol · 2025-02-11T04:28:00Z

@tcclevenger I'm adding some more stuff. Since compilation times are going up a lot, I am adding ETI for the update method. The combinations are quite a few, but mostly b/c of the fine grain combinations in CombineMode:

enum class CombineMode {
  ScaleUpdate,  // out = beta*out + alpha*in (most generic case)
  Update,       // out = beta*out + in (special case of ScaleUpdate wiht alpha=1)
  ScaleAdd,     // out = out + alpha*in (special case of ScaleUpdate with beta=1)
  ScaleReplace, // out = alpha*in (special case of ScaleUpdate with beta=0)
  Add,          // out = out + in (special case of ScaleUpdate with alpha=beta=1)
  Rescale,      // out = beta*out
  Replace,      // out = in
  Multiply,     // out = out*in
  Divide,       // out = out/in
  Max,          // out = max(out,in)
  Min           // out = min(out,in)
};

I was thinking: what if we remove all except the last four and the first? All others can be expressed in terms of ScaleUpdate (which we may just rename "Update" for simplicity). The reason I added all of those was to allow a single interface (which used alpha and beta), but guarantee zero overhead when the combine mode did not require that coefficient (e.g., avoid doing y=b*y+a*x when b=0 and a=1, and just do y=x). But I think these are very tiny optimization, and the perf-critical part of the code does not use Field::udpate calls, so maybe it's ok to just have

enum class CombineMode {
  Update,       // out = beta*out + alpha*in
  Multiply,     // out = alpha*out*in
  Divide,       // out = beta*out/in
  Max,          // out = max(out,in)
  Min           // out = min(out,in)
};

This would drastically reduce the number of ETI calls we do (by a factor 6/11). These remaining 5 cannot be expressed in terms of each other, so it clearly cannot be reduced further.

bartgol · 2025-02-11T06:31:14Z

Ok, I decidedto pull the trigger on reducing CombineMode options, as well as adding some ETI for Field methods, which started to grow. I ran some timing tests on my laptop (not a great data point, but still):

without ETI:

$ time make field/field.cpp.o
[ 52%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field.cpp.o

real	0m29.734s
user	0m28.243s
sys	0m1.414s
$ time make field_tests.cpp.o
Building CXX object src/share/tests/CMakeFiles/field.dir/field_tests.cpp.o

real	2m21.124s
user	2m13.640s
sys	0m6.403s

with ETI:

$ time make field/field.cpp.o
[ 47%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field.cpp.o

real	0m36.233s
user	0m34.375s
sys	0m1.697s
$ time make field/field_eti_host.cpp.o
[ 47%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field_eti_host.cpp.o

real	0m22.793s
user	0m21.837s
sys	0m0.784s
$ time make field/field_eti_device.cpp.o
[ 52%] Building CXX object src/share/CMakeFiles/scream_share.dir/field/field_eti_device.cpp.o

real	0m54.430s
user	0m50.154s
sys	0m4.040s
$ time make field_tests.cpp.o
[ 52%] Building CXX object src/share/tests/CMakeFiles/field.dir/field_tests.cpp.o

real	1m0.103s
user	0m57.148s
sys	0m2.564s

The compilation of the templated methods for HostOrDevice::Device is what takes most of the time (that's b/c of the update methods, which have a lot of ramifications). I have only added ETI for update for Update/Multiply/Divide. The Max/Min impl are currently unused, although I plan to use them in the scorpio output class (I will just rely on implicit instantiation in that file).

Also, I split device and host ETI into two files, so that they can be built in parallel, reducing a bit the compilation time.

Notice that here I'm only showing the compilation time saved for field_tests.cpp.o, but all cpp.o files that include field.hpp and use these common ETI-ed methods will see some compilation time improvement. We are paying the price once in the compilation of field_eti_XYZ.cpp, but gaining across all the eamxx files that use field templated methods.

I will run a bit more precise experiments tomorrow.

* Implement scale/scale_inv in terms of update * Use fill_val only if needed

That is, we can do y.scale(x), if y's data type is double and x's data type is int, but not viceversa

In Field::update_impl, in order to use if constexpr in a kernel we must use a functor, and not an extended host-device lambda.

The scalar type of the two fields may differ, so we must retrieve fill val from *this and x according to their scalar types.

Namely, remove those that were special cases of ScaleUpdate, and rename ScaleUpdate simply Update

* ETI separately for int/float/double and host/device * ETI separately each function * Remove possibility of updating a field with another with different (but compatible) scalar type. It generates too many template instantiations * Do not implement deep_copy in terms of update. It would prevent replacing NaN with valid values, since update's arithmetic implementation would propagate NaN's. Signed-off-by: Luca Bertagna <[email protected]>

bartgol · 2025-02-11T22:16:24Z

@tcclevenger this PR is growing into much more than it originally was. Yet, I think it's valuable. Here's a recap of all you will find:

Remove certain CombineMode enum values. They were particular cases of the y=ax+by update formula, for a or b being 0 or 1. Supporting these caused certain template methods to be compiled many more times, for virtually no gain. The performance gain of, say, avoiding a multiplication by 0 was likely minimal, especially since we don't really used these special cases much, and never in perf critical code.
Add min/max for CombineMode. This is not really needed here, but I am also rewriting scorpio_output to use field update methods (rather than roll pfor loops), and min/max will be needed there.
Implement scale/scale_inv in terms of the update function (since we no longer have CombineMode::Rescale).
In the field update method, we handle the case with fill value separately, by checking if the mask_value extra data is present, and calling udpate_impl with different compile-time args.
In Field::udpate, handle contiguity of lhs and rhs separately. That is, take advantage of contiguous views if possible (before we had either both using contiguous or both using strided)
The step above required adding a functor (CombineViewsHelper) to minimize src code duplication.
Add ETI for a few templated methods of Field. This may be the largest change. I decided to ETI some methods w.r.t to scalar type, host-or-device, and combine mode (where appropriate). For scalar type, I only ETI-ed int, float, and double. We could easily add pack types, for the default pack size, which may be also beneficial. I am wondering whether doing both float and double makes sense, since we currently never use anything other than Real. But the update methods, after checking the field data type, do some if/else stuff, and invoke get_view/get_strided_view also for float. Perhaps we could restrict usage to DataType=IntType and DataType=RealType; that may prove limiting if we ever explore mixed precision, but maybe it's the right thing to do for now. Let me know your thoughts.
The file field.hpp still includes field_impl.hpp, so that if the templated methods are invoked with non-ETI-ed parameters, the compiler can still implicitly instantiate them.

For the ETI, I clocked building the src folder (no tpls or system tests) on my workstation (haswell, 10 cores) with make -j10. I did that with ETI, as well as commenting out all the ETI-related stuff. The timings are:

with ETI: 6m 2s
without ETI: 11m 18s

Now, this was just my machine, and things may be quite different on other archs, but it definitely helped, so I expect it to help a bit on other platforms too.

Also, the PR has commits adding something, and then taking it back. Let me know if you'd rather me rewrite the git history in a cleaner way.

Cannot rely on Kokkos::deep_copy, since one view may be padded and the other may be not padded. Signed-off-by: Luca Bertagna <[email protected]>

bartgol added EAMxx PRs focused on capabilities for EAMxx code usability code cleanup labels Feb 4, 2025

bartgol requested a review from tcclevenger February 4, 2025 00:43

bartgol self-assigned this Feb 4, 2025

tcclevenger approved these changes Feb 4, 2025

View reviewed changes

bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from 09f6f77 to 094230b Compare February 5, 2025 00:03

bartgol requested a review from tcclevenger February 5, 2025 20:11

tcclevenger approved these changes Feb 7, 2025

View reviewed changes

components/eamxx/src/share/field/field.hpp Outdated Show resolved Hide resolved

bartgol requested a review from tcclevenger February 10, 2025 23:41

bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from e064584 to 44d44f5 Compare February 11, 2025 06:28

bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from 44d44f5 to 903848c Compare February 11, 2025 17:13

bartgol added 15 commits February 11, 2025 12:27

EAMxx: refactor field manipulation interfaces

1a327b8

* Implement scale/scale_inv in terms of update * Use fill_val only if needed

EAMxx: add max/min option for CombineMode

4df7510

EAMxx: add max/min update for Field

2ccf32f

EAMxx: implement more Field methods in terms of Field::update

4912f38

EAMxx: add early return for Field::deep_copy

a832c20

EAMxx: fix fill value type in Field::update

7546000

EAMxx: allow non-narrowing type conversion in Field::udpate

73aa8b0

That is, we can do y.scale(x), if y's data type is double and x's data type is int, but not viceversa

EAMxx: fix compilation errors in unit tests

75f3aed

EAMxx: handle non-contiguous fields in Field::update_impl

8de9d26

EAMxx: use MDRangePolicy in Field::update_impl

4f51301

EAMxx: work around if-constexpr incompatibility inside KOKKOS_LAMBDA

c978594

In Field::update_impl, in order to use if constexpr in a kernel we must use a functor, and not an extended host-device lambda.

EAMxx: treat x and y separately in update_impl w.r.t. congiguity

f425978

EAMxx: remove stranded unused typedef

5f35c47

EAMxx: fix retrieval of mask_value in Field::update

69287e8

The scalar type of the two fields may differ, so we must retrieve fill val from *this and x according to their scalar types.

EAMxx: remove most of the CombineMode enum values

82d694b

Namely, remove those that were special cases of ScaleUpdate, and rename ScaleUpdate simply Update

bartgol added 2 commits February 11, 2025 12:27

EAMxx: add some ETI for Field template methods

e9596c6

bartgol force-pushed the bartgol/eamxx/field-ops-upgrade branch from 85fbf10 to 8d577d4 Compare February 11, 2025 20:39

bartgol added 2 commits February 11, 2025 16:59

EAMxx: remove test of feature not supported

a3be215

EAMxx: fix Field::deep_copy_impl

65fc93a

Cannot rely on Kokkos::deep_copy, since one view may be padded and the other may be not padded. Signed-off-by: Luca Bertagna <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EAMxx: refactor field manipulation interfaces, and add new ones #6970

EAMxx: refactor field manipulation interfaces, and add new ones #6970

bartgol commented Feb 4, 2025 •

edited

Loading

tcclevenger left a comment

bartgol commented Feb 4, 2025

bartgol commented Feb 11, 2025

bartgol commented Feb 11, 2025 •

edited

Loading

bartgol commented Feb 11, 2025 •

edited

Loading

EAMxx: refactor field manipulation interfaces, and add new ones #6970

Are you sure you want to change the base?

EAMxx: refactor field manipulation interfaces, and add new ones #6970

Conversation

bartgol commented Feb 4, 2025 • edited Loading

[BFB]

tcclevenger left a comment

Choose a reason for hiding this comment

bartgol commented Feb 4, 2025

bartgol commented Feb 11, 2025

bartgol commented Feb 11, 2025 • edited Loading

bartgol commented Feb 11, 2025 • edited Loading

bartgol commented Feb 4, 2025 •

edited

Loading

bartgol commented Feb 11, 2025 •

edited

Loading

bartgol commented Feb 11, 2025 •

edited

Loading