feedback on group load/store #16800

jinz2014 · 2025-01-27T17:47:22Z

Hello

Please see the benchmark https://github.com/zjin-lcf/HeCBench/tree/master/src/blockAccess-sycl

"main.cpp" evaluates the group load/store with the migration of CUB's block load/store.
"main2.cpp" evaluate the SYCL group load/store.

The time of running the two blockAccess kernels on an Intel Max1100 device is shown below:

icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU -c main.cpp -o main.o
icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU main.o -o main
./main 8192 8192 100
Average execution time of the reference kernel: 1317.816895 (us)
Average execution time of the blockAccess kernel: 606.993347 (us)

icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU -c main2.cpp -o main2.o
icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU main2.o -o main
./main 8192 8192 100
Average execution time of the reference kernel: 1422.783325 (us)
Average execution time of the blockAccess kernel: 2391.377686 (us)

I am not sure if the kernel in main.cpp is faster than that in main2.cpp because of the shared local memory.

Another question: does the SYCL group load/store function need an argument for "block_items_end – [in] Number of valid items to load" ?

Thanks

aelovikov-intel · 2025-01-31T21:24:55Z

Implementation doesn't apply any optimizations without sycl::ext::oneapi::experimental::full_group property. See the extension documentation at https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_group_load_store.asciidoc#groups-partitioning

jinz2014 · 2025-02-01T00:05:22Z

Thanks. I added the "props" to group load and store

  auto props = sycl_exp::properties{sycl_exp::data_placement_blocked, 
                                    sycl_exp::contiguous_memory,
                                    sycl_exp::full_group
                                   };

e.g. sycl_exp::group_load(g, A+i, sycl::span{vals}, props);

./main 8192 8192 100
Average execution time of the reference kernel: 1117.829102 (us)
Average execution time of the blockAccess kernel: 2393.927979 (us)

Is this reproducible ?

aelovikov-intel · 2025-02-01T00:45:13Z

Is this reproducible ?

I haven't tried, but the link above doesn't have any changes. Also, why has your baseline improved in the last comment when you supposedly changed the "blocked" code path?

jinz2014 · 2025-02-01T01:31:01Z

I made the changes locally and didn't push them to the repository. I didn't make any changes to the baseline. Sorry about the confusion.

jinz2014 mentioned this issue Feb 3, 2025

question about DPCT1007 oneapi-src/SYCLomatic#2636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feedback on group load/store #16800

feedback on group load/store #16800

jinz2014 commented Jan 27, 2025

aelovikov-intel commented Jan 31, 2025

jinz2014 commented Feb 1, 2025

aelovikov-intel commented Feb 1, 2025

jinz2014 commented Feb 1, 2025

feedback on group load/store #16800

feedback on group load/store #16800

Comments

jinz2014 commented Jan 27, 2025

aelovikov-intel commented Jan 31, 2025

jinz2014 commented Feb 1, 2025

aelovikov-intel commented Feb 1, 2025

jinz2014 commented Feb 1, 2025