[DRAFT] Changes fp8 implementation to more closely match NCCL, and added logi… #1619

corey-derochie-amd · 2025-03-27T16:14:12Z

…c to handle FNUZ types.

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
One sentence describing the work done.

Why were the changes made?
Explain the motivation behind the work. Provide any publicly-available historical context.

How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

src/include/rccl_float8.h

test/common/CollectiveArgs.cpp

…ased on the flag

…eck.

test/common/PtrUnion.cpp

alex-breslow-amd · 2025-04-07T20:23:02Z

src/misc/msccl/msccl_setup.cc

@@ -277,11 +277,19 @@ static ncclResult_t hostToDevRedOp(
    #if defined(RCCL_FLOAT8)
    case ncclFloat8e4m3:
      opFull->op = ncclDevPreMulSum;
-      fp8_e4m3 = (rccl_float8)(float(1.0/comm->nRanks));
+      if (rccl_float8_useFnuz) {
+        fp8_e4m3_fnuz = (rccl_float8_fnuz)(float(1.0/comm->nRanks));


Why do we cast from double to float and then again to the 8-bit float type? Are casts from double to rccl_float8_fnuz not possible?

I'm also curious why we're not using static_cast instead of c-style casts.

alex-breslow-amd · 2025-04-07T20:25:37Z

test/common/PtrUnion.cpp

-      case ncclFloat8e4m3:  F1[idx] = rccl_float8(ReduceOp(op, float(F1[idx]), float(inputCpu.F1[idx]))); break;
+      case ncclFloat8e4m3:
+      {
+        if (PtrUnion_Float8UseFnuz) {


Does this slow down rccl unit testing? If so, by how much?

corey-derochie-amd force-pushed the fp8-fnuz branch 2 times, most recently from 25f9475 to e4fcde4 Compare April 1, 2025 16:48

corey-derochie-amd commented Apr 1, 2025

View reviewed changes

src/include/rccl_float8.h Outdated Show resolved Hide resolved

corey-derochie-amd commented Apr 1, 2025

View reviewed changes

test/common/CollectiveArgs.cpp Outdated Show resolved Hide resolved

corey-derochie-amd added 7 commits April 1, 2025 17:45

Adapted fp8 code for compatibility with NCCL

9ac1e8a

Initial refactor of rccl_float8 to prepare for hip types.

9c9de0a

Added fnuz types without using them yet

03a82c6

Added FNUZ types to the generated kernels

5150f1a

Added flag to track use of fnuz types and logic to switch fp8 types b…

842097e

…ased on the flag

Switched to hip fp8 types

e9c5b9e

Some code cleanup

da94a6f

corey-derochie-amd force-pushed the fp8-fnuz branch from 27a6182 to da94a6f Compare April 1, 2025 22:47

corey-derochie-amd marked this pull request as ready for review April 1, 2025 22:48

corey-derochie-amd requested review from wenkaidu, gilbertlee-amd, akolliasAMD, PedramAlizadeh, nusislam, nileshnegi, KawtharShafie, AtlantaPepsi, mberenjk, mustafabar, thananon, JhaShweta1, BertanDogancay, rahulvaidya20, jee7s, isaki001 and PJAvinash as code owners April 1, 2025 22:48

corey-derochie-amd requested review from AbandiGa, Nikhil-Nunna and haripriya-amd as code owners April 1, 2025 22:48

corey-derochie-amd marked this pull request as draft April 1, 2025 22:49

corey-derochie-amd changed the title ~~Changes fp8 implementation to more closely match NCCL, and added logi…~~ [DRAFT] Changes fp8 implementation to more closely match NCCL, and added logi… Apr 1, 2025

Removed unused downcast code from rccl_float8.h. Fixed HIP_VERSION ch…

039d001

…eck.

alex-breslow-amd reviewed Apr 7, 2025

View reviewed changes

test/common/PtrUnion.cpp Outdated Show resolved Hide resolved

alex-breslow-amd reviewed Apr 7, 2025

View reviewed changes

test/common/PtrUnion.cpp Outdated Show resolved Hide resolved

alex-breslow-amd reviewed Apr 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Changes fp8 implementation to more closely match NCCL, and added logi… #1619

[DRAFT] Changes fp8 implementation to more closely match NCCL, and added logi… #1619

corey-derochie-amd commented Mar 27, 2025

alex-breslow-amd Apr 7, 2025

alex-breslow-amd Apr 7, 2025

[DRAFT] Changes fp8 implementation to more closely match NCCL, and added logi… #1619

Are you sure you want to change the base?

[DRAFT] Changes fp8 implementation to more closely match NCCL, and added logi… #1619

Conversation

corey-derochie-amd commented Mar 27, 2025

Details

Approval Checklist

alex-breslow-amd Apr 7, 2025

Choose a reason for hiding this comment

alex-breslow-amd Apr 7, 2025

Choose a reason for hiding this comment