Add support for different src and dst datatypes in the SYCL implementation for pooling #1878

kala855 · 2024-04-25T13:07:49Z

Description

This pull request facilitates the utilization of distinct source (src) and destination (dst) datatypes during the forward inference pass within the SYCL pooling operator implementation.

dzarukin · 2024-04-25T15:50:55Z

tests/benchdnn/pool/ref_pool.cpp

@@ -38,6 +38,9 @@ void compute_ref_fwd(const prb_t *prb, const args_t &args) {
        // XXX: this is a hack to let tests with padded area to pass for bf16
        // dt due to the library initialize values with -max_dt, but not -INF.
        float max_value = lowest_dt(prb->dst_dt());
+        if (prb->src_dt() == dnnl_u8 || prb->src_dt() == dnnl_s8
+                || prb->src_dt() == dnnl_f16 || prb->src_dt() == dnnl_f32)
+            max_value = lowest_dt(prb->src_dt());


Could you elaborate, please, why is this needed?

Thanks for the comment. Line 40 is a previous hack that was there to deal with bf16. Because of that hack enabling different src/dst datatypes failed, the max_value was taking into account the dst datatype instead of the src. When src or dst uses different datatypes there are cases where max_value takes negative values even when src_dt=u8 (just to mention an example). Checking the src_dt instead of dst_dt allows us to be sure that we initialize the max_value variable correctly.

Different data types are supported for other backends and yet so far everything worked beside that issue.
Please post benchdnn command and output without this patch to benchdnn where it fails. 2-3 would be enough. Thanks.

The output I was having when enabling the new src/dst combinations are:

./tests/benchdnn/benchdnn --pool --engine=gpu --dir=FWD_I --dt=u8:s8 --tag=axb ic35iw13ow14kw3pw3 [ 0][DST][0:0:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 14][DST][0:1:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 28][DST][0:2:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 42][DST][0:3:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 56][DST][0:4:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 70][DST][0:5:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 84][DST][0:6:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 98][DST][0:7:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 112][DST][0:8:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [ 126][DST][0:9:0] exp_f32: -128 exp: -128 got: 0 diff: 128 rdiff: 1 [COMPARE_STATS][DST]: trh=0 err_max_diff: 128 err_max_rdiff: 1 all_max_diff: 128 all_max_rdiff: 1 0:FAILED (errors:70 total:980) __REPRO: --pool --engine=gpu --dir=FWD_I --dt=u8:s8 --tag=axb ic35iw13ow14kw3pw3 tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0 total: 1.63s; fill: 0.44s (27%); compute_ref: 0.00s (0%); compare: 0.02s (1%); TBB Warning: Leaked 2 observer_proxy objects

./tests/benchdnn/benchdnn --pool --engine=gpu --dir=FWD_I --dt=f16:u8 --tag=axb --attr-post-ops=add:f32:per_oc ic32ih32iw13oh17ow14kh4kw6sh2sw1ph4pw6 [ 0][DST][0:0:0:0] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 1][DST][0:0:0:1] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 2][DST][0:0:0:2] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 3][DST][0:0:0:3] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 4][DST][0:0:0:4] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 5][DST][0:0:0:5] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 6][DST][0:0:0:6] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 7][DST][0:0:0:7] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 8][DST][0:0:0:8] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [ 9][DST][0:0:0:9] exp_f32: 13.5 exp: 14 got: 0 diff: 14 rdiff: 1 [COMPARE_STATS][DST]: trh=0 err_max_diff: 16 err_max_rdiff: 1 all_max_diff: 16 all_max_rdiff: 1 0:FAILED (errors:1998 total:15232) __REPRO: --pool --engine=gpu --dir=FWD_I --dt=f16:u8 --tag=axb --attr-post-ops=add:f32:per_oc ic32ih32iw13oh17ow14kh4kw6sh2sw1ph4pw6 tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0 total: 1.35s; fill: 0.17s (12%); compute_ref: 0.00s (0%); compare: 0.03s (2%); TBB Warning: Leaked 2 observer_proxy objects

./tests/benchdnn/benchdnn --pool --engine=gpu --dir=FWD_I --dt=s8:f32 --tag=axb ic35iw13ow14kw3pw3 [ 0][DST][0:0:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 14][DST][0:1:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 28][DST][0:2:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 42][DST][0:3:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 56][DST][0:4:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 70][DST][0:5:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 84][DST][0:6:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 98][DST][0:7:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 112][DST][0:8:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [ 126][DST][0:9:0] exp_f32:-3.40282e+38 exp:-3.40282e+38 got: -128 diff:3.40282e+38 rdiff: 1 [COMPARE_STATS][DST]: trh=0 err_max_diff:3.40282e+38 err_max_rdiff: 1 all_max_diff:3.40282e+38 all_max_rdiff: 1 0:FAILED (errors:70 total:980) __REPRO: --pool --engine=gpu --dir=FWD_I --dt=s8:f32 --tag=axb ic35iw13ow14kw3pw3 tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0 total: 1.25s; fill: 0.06s (5%); compute_ref: 0.00s (0%); compare: 0.08s (7%); TBB Warning: Leaked 2 observer_proxy objects

What I think is that the values initialized in the tests/benchdnn/pool/ref_pool.cpp source file were breaking the tests. Let me know what do you think about it. Thanks for your help.

@kala855, thanks for these examples. It seems there's no clear option how to proceed with it, and it will require clarification with users on what's the expected behavior here. Two possible situations:

In the sense that there's no quantized operations defined in framework graphs, all operations are considered as f32/bf16. It means that data would be converted to f32/bf16, computed, and then converted to output data type separately. This is aligned with current benchdnn expectations.

The quantized operation uses src_dt as a data type to update values for max pooling. Then converts it into f32 to apply post-ops, if any, and then converts to dst_dt. This is aligned with the new change and breaking the expectations of the previous one.

A request here is to postpone this PR until we get clarifications from frameworks and proceed with the most reasonable option. Thanks.

I'd expect benchdnn change to be dropped from the final version with limited int8 support...

@dzarukin I did take into account the suggestions made. However, with limited int8 support, this part of the code still needs to be updated. Because we accept different src/dst datatypes we need to initialize the max pooling algorithm differently for Nvidia GPU. I included a conditional here to initialize the data using prb->src_dt().

dzarukin · 2024-04-25T15:51:15Z

src/gpu/sycl/ref_pooling.hpp

+                    && (!utils::one_of(f64, src_md(0)->data_type)
+                            && !utils::one_of(f64, dst_md(0)->data_type))
+                    && (IMPLICATION(utils::one_of(bf16, src_md(0)->data_type),
+                            utils::one_of(bf16, dst_md(0)->data_type)))


Could you explain, please, what makes bf16 so special?

In this instance, we're adhering to the specifications outlined in the table available here. Notably, the bf16 scenario stands out as one case requiring identical source and destination data types. Thanks for the comment. Let me know if something else needs to be clarified.

I don't see a strong reason to limit the support if it can work as expected.

kala855 · 2024-06-17T09:46:11Z

@dzarukin We take into account what was discussed during the meeting. The PR now avoids applying pooling src:dst (u8:u8, s8:s8).

mgouicem · 2024-06-25T07:54:15Z

@dzarukin We take into account what was discussed during the meeting. The PR now avoids applying pooling src:dst (u8:u8, s8:s8).

Could you clarify what issue does this fix?
IIUC, the main problem is in the selection of datatype for the max value computation (see this comment). In the case where both src and dst have same datatype, there should be no ambiguity.

mgouicem · 2024-06-25T07:39:17Z

src/gpu/sycl/ref_pooling.hpp

+                    && f64 != src_md(0)->data_type
+                    && f64 != dst_md(0)->data_type
+                    && (IMPLICATION(bf16 == src_md(0)->data_type,
+                            bf16 == dst_md(0)->data_type))


this is redundant with line 56-57.

It was fixed in the current commit. Thanks.

mgouicem · 2024-06-25T07:42:33Z

src/gpu/sycl/ref_pooling.hpp

+                            dst_md(0)->data_type != u8))
+                    && (IMPLICATION(
+                            src_md(0)->data_type != dst_md(0)->data_type,
+                            desc_.prop_kind == forward_inference))


Not sure I understand the logic here. You are disabling src and dst datatype to be the same except for bf16.
Having the same datatype for src and dst should be valid.

In this table src/dst datatypes could be different just when forward_inference is used. The bf16 dt is not combined with any other dt. Maybe I am misunderstanding something in the table. Let me know if this clarifies a little bit the questions. Thanks.

kala855 · 2024-06-26T07:50:28Z

@dzarukin We take into account what was discussed during the meeting. The PR now avoids applying pooling src:dst (u8:u8, s8:s8).

Could you clarify what issue does this fix? IIUC, the main problem is in the selection of datatype for the max value computation (see this comment). In the case where both src and dst have same datatype, there should be no ambiguity.

@mgouicem We received feedback mentioning that we need to avoid the use of u8:u8 and s8:s8. Maybe @dzarukin have any additional clue about it. If this combination is accepted we need to enabled it an run some tests. I am in line with your thoughts about the max value computation.

dzarukin · 2024-07-01T05:56:41Z

@mgouicem We received feedback mentioning that we need to avoid the use of u8:u8 and s8:s8. Maybe @dzarukin have any additional clue about it. If this combination is accepted we need to enabled it an run some tests. I am in line with your thoughts about the max value computation.

u8:s8 and s8:u8 have semantic problems.

kala855 · 2024-07-03T09:34:16Z

@mgouicem We received feedback mentioning that we need to avoid the use of u8:u8 and s8:s8. Maybe @dzarukin have any additional clue about it. If this combination is accepted we need to enabled it an run some tests. I am in line with your thoughts about the max value computation.

u8:s8 and s8:u8 have semantic problems.

@dzarukin just to be sure ... The ones we will need to avoid are u8:s8 and s8:u8 ... However, u8:u8 and s8:s8 are allowed?
The previous just to be sure we are talking about the same and proceed to do the corrections. Thank you.

dzarukin · 2024-07-03T19:08:47Z

@mgouicem We received feedback mentioning that we need to avoid the use of u8:u8 and s8:s8. Maybe @dzarukin have any additional clue about it. If this combination is accepted we need to enabled it an run some tests. I am in line with your thoughts about the max value computation.

u8:s8 and s8:u8 have semantic problems.

@dzarukin just to be sure ... The ones we will need to avoid are u8:s8 and s8:u8 ... However, u8:u8 and s8:s8 are allowed? The previous just to be sure we are talking about the same and proceed to do the corrections. Thank you.

Yes, those are allowed and should be supported. There's no ambiguity in such combinations - same init value is used on both ends.

kala855 · 2024-07-10T09:32:37Z

@mgouicem We received feedback mentioning that we need to avoid the use of u8:u8 and s8:s8. Maybe @dzarukin have any additional clue about it. If this combination is accepted we need to enabled it an run some tests. I am in line with your thoughts about the max value computation.

u8:s8 and s8:u8 have semantic problems.

@dzarukin just to be sure ... The ones we will need to avoid are u8:s8 and s8:u8 ... However, u8:u8 and s8:s8 are allowed? The previous just to be sure we are talking about the same and proceed to do the corrections. Thank you.

Yes, those are allowed and should be supported. There's no ambiguity in such combinations - same init value is used on both ends.

The implementation now does what is suggested. Let me know if something else is missing @dzarukin . Thank you.

dzarukin · 2024-07-10T21:30:51Z

tests/benchdnn/pool/ref_pool.cpp

@@ -38,6 +38,9 @@ void compute_ref_fwd(const prb_t *prb, const args_t &args) {
        // XXX: this is a hack to let tests with padded area to pass for bf16
        // dt due to the library initialize values with -max_dt, but not -INF.
        float max_value = lowest_dt(prb->dst_dt());
+        if (prb->src_dt() == dnnl_u8 || prb->src_dt() == dnnl_s8
+                || prb->src_dt() == dnnl_f16 || prb->src_dt() == dnnl_f32)
+            max_value = lowest_dt(prb->src_dt());


I'd expect benchdnn change to be dropped from the final version with limited int8 support...

dzarukin · 2024-07-10T21:31:24Z

src/gpu/generic/sycl/ref_pooling.hpp

+                    && (!utils::one_of(f64, src_md(0)->data_type)
+                            && !utils::one_of(f64, dst_md(0)->data_type))


Suggested change

&& (!utils::one_of(f64, src_md(0)->data_type)

&& !utils::one_of(f64, dst_md(0)->data_type))

&& (!utils::one_of(f64, src_md(0)->data_type, dst_md(0)->data_type))

Change added.

dzarukin · 2024-07-10T21:32:27Z

src/gpu/generic/sycl/ref_pooling.hpp

+                            dst_md(0)->data_type != s8))
+                    && (IMPLICATION(
+                            src_md(0)->data_type != dst_md(0)->data_type,
+                            desc_.prop_kind == forward_inference))


Suggested change

desc_.prop_kind == forward_inference))

desc()->prop_kind == forward_inference))

Change added.

dzarukin · 2024-07-10T21:32:44Z

src/gpu/generic/sycl/ref_pooling.hpp

+                    && (IMPLICATION(utils::one_of(bf16, src_md(0)->data_type),
+                            utils::one_of(bf16, dst_md(0)->data_type)))


I'd stick to plain comparison.

All of the changes were included. Let me know if something else is needed. Thank you.

kala855 requested review from densamoilov, dzarukin, igorsafo and mgouicem April 25, 2024 14:32

dzarukin reviewed Apr 25, 2024

View reviewed changes

vpirogov added this to the v3.6 milestone May 21, 2024

kala855 requested a review from dzarukin June 17, 2024 09:46

mgouicem reviewed Jun 25, 2024

View reviewed changes

kala855 force-pushed the kala855/sycl-pooling-dt branch from c1c7999 to d143eec Compare June 26, 2024 07:37

kala855 requested a review from mgouicem July 3, 2024 09:34

kala855 force-pushed the kala855/sycl-pooling-dt branch from d143eec to f017483 Compare July 4, 2024 09:11

dzarukin reviewed Jul 10, 2024

View reviewed changes

kala855 force-pushed the kala855/sycl-pooling-dt branch from f017483 to 81aa661 Compare July 11, 2024 09:26

kala855 requested review from dzarukin, t4c1, ShanoToni, sgeor255 and dylanangus July 11, 2024 09:28

t4c1 approved these changes Jul 19, 2024

View reviewed changes

gpu: sycl: pooling: Enabling different src/dst dt

9879e77

kala855 force-pushed the kala855/sycl-pooling-dt branch from 81aa661 to 9879e77 Compare July 23, 2024 13:19

ShanoToni approved these changes Jul 25, 2024

View reviewed changes

vpirogov added the platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic label Jul 26, 2024

densamoilov approved these changes Jul 31, 2024

View reviewed changes

densamoilov merged commit de28cf2 into main Jul 31, 2024
10 checks passed

densamoilov deleted the kala855/sycl-pooling-dt branch July 31, 2024 03:17

		&& (!utils::one_of(f64, src_md(0)->data_type)
		&& !utils::one_of(f64, dst_md(0)->data_type))

	desc_.prop_kind == forward_inference))
	desc()->prop_kind == forward_inference))

		&& (IMPLICATION(utils::one_of(bf16, src_md(0)->data_type),
		utils::one_of(bf16, dst_md(0)->data_type)))

Add support for different src and dst datatypes in the SYCL implementation for pooling #1878

Add support for different src and dst datatypes in the SYCL implementation for pooling #1878

Uh oh!

Conversation

kala855 commented Apr 25, 2024

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kala855 Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kala855 commented Jun 17, 2024

Uh oh!

mgouicem commented Jun 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kala855 commented Jun 26, 2024

Uh oh!

dzarukin commented Jul 1, 2024

Uh oh!

kala855 commented Jul 3, 2024

Uh oh!

dzarukin commented Jul 3, 2024

Uh oh!

kala855 commented Jul 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kala855 Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kala855 Jul 23, 2024 •

edited

Loading

kala855 Jul 11, 2024 •

edited

Loading