Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

soumikiith
Copy link

@soumikiith soumikiith commented Dec 20, 2024

Description

closes #2976

I have provided generic and safe C++ interface for warp shuffle (shuffle_sync only for now). The safety features include: (1) checking for allowable data types, (2) handling of variables that consists of 4 bytes (32 bits).
Soon, I will post the feature to handle 16 bit and 64 bit data types.

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@soumikiith soumikiith requested review from a team as code owners December 20, 2024 13:01
Copy link

copy-pr-bot bot commented Dec 20, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@fbusato
Copy link
Contributor

fbusato commented Dec 20, 2024

thanks for the contribution, @soumikiith. I have a couple of initial comments.

  • cmath provides a set of mathematical operations, while warp shuffles are about data movement. I would create another header cuda/shuffle.
  • you don't need to handle all data types one by one, or by size. My suggestion is to create an array of uint32_t and then use memcpy. Even better if you find a way to use bit_cast.

@fbusato
Copy link
Contributor

fbusato commented Dec 20, 2024

I updated #2976 to better formalize the features and checks of these functions

@soumikiith
Copy link
Author

soumikiith commented Dec 21, 2024

One Question:

While computing laneid, can I use modulo operator ? Or is the preferable way to fetch it directly from assembly using asm instructions?

Note that my doubt is only in the context of shfl_up and shfl_down.

Also, why does a mask value need to be passed (I know that the default value is assigned) in shfl_xor? Is not passing lanemask sufficient ?

@fbusato
Copy link
Contributor

fbusato commented Dec 23, 2024

While computing laneid, can I use modulo operator ? Or is the preferable way to fetch it directly from assembly using asm instructions?

you can use C++ API for PTX, see https://nvidia.github.io/cccl/libcudacxx/ptx/instructions/special_registers.html#laneid

Also, why does a mask value need to be passed (I know that the default value is assigned) in shfl_xor? Is not passing lanemask sufficient ?

Referring to the official documentation, laneMask and mask have different meaning. mask represents the active lanes, while laneMask is the value to apply to the XOR operator, i.e. laneid() ^ laneMask

@soumikiith
Copy link
Author

Hi, I have added the checks (I need to fix the assertion statements, though). Please check them and let me know if this is meeting your expected requirements. I will soon commit the casting of different data types using memcpy.

Please let me know of any additional requirements.

@soumikiith
Copy link
Author

Hi,
I have added the code to do the __shfl operations for various data types. Please let me know if anything is to be added or if anything is flawed. I will happily revise my code.

Merry Christmas !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[FEA]: Provide generic and safe C++ interfaces for warp shuffle
2 participants