Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feedback on group load/store #16800

Open
jinz2014 opened this issue Jan 27, 2025 · 4 comments
Open

feedback on group load/store #16800

jinz2014 opened this issue Jan 27, 2025 · 4 comments

Comments

@jinz2014
Copy link
Contributor

Hello

Please see the benchmark https://github.com/zjin-lcf/HeCBench/tree/master/src/blockAccess-sycl

"main.cpp" evaluates the group load/store with the migration of CUB's block load/store.
"main2.cpp" evaluate the SYCL group load/store.

The time of running the two blockAccess kernels on an Intel Max1100 device is shown below:

icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU -c main.cpp -o main.o
icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU main.o -o main
./main 8192 8192 100
Average execution time of the reference kernel: 1317.816895 (us)
Average execution time of the blockAccess kernel: 606.993347 (us)

icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU -c main2.cpp -o main2.o
icpx -std=c++17 -Wall -fsycl --gcc-toolchain= -O3 -DUSE_GPU main2.o -o main
./main 8192 8192 100
Average execution time of the reference kernel: 1422.783325 (us)
Average execution time of the blockAccess kernel: 2391.377686 (us)

I am not sure if the kernel in main.cpp is faster than that in main2.cpp because of the shared local memory.

Another question: does the SYCL group load/store function need an argument for "block_items_end – [in] Number of valid items to load" ?

Thanks

@aelovikov-intel
Copy link
Contributor

Implementation doesn't apply any optimizations without sycl::ext::oneapi::experimental::full_group property. See the extension documentation at https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_group_load_store.asciidoc#groups-partitioning

@jinz2014
Copy link
Contributor Author

jinz2014 commented Feb 1, 2025

Thanks. I added the "props" to group load and store

  auto props = sycl_exp::properties{sycl_exp::data_placement_blocked, 
                                    sycl_exp::contiguous_memory,
                                    sycl_exp::full_group
                                   };

e.g. sycl_exp::group_load(g, A+i, sycl::span{vals}, props);

./main 8192 8192 100
Average execution time of the reference kernel: 1117.829102 (us)
Average execution time of the blockAccess kernel: 2393.927979 (us)

Is this reproducible ?

@aelovikov-intel
Copy link
Contributor

Is this reproducible ?

I haven't tried, but the link above doesn't have any changes. Also, why has your baseline improved in the last comment when you supposedly changed the "blocked" code path?

@jinz2014
Copy link
Contributor Author

jinz2014 commented Feb 1, 2025

I made the changes locally and didn't push them to the repository. I didn't make any changes to the baseline. Sorry about the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants