There was a merged PR for [CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda
#1796 and the related issue oneapi-src/unified-runtime#1424
However, simply migrating the implementation shown below from Cuda to Hip does not work as the kernel (hKernel->get()) is not valid for the hipOccupancyMaxActiveBlocksPerMultiprocessor.
UR_APIEXPORT ur_result_t UR_APICALL urKernelSuggestMaxCooperativeGroupCountExp(
ur_kernel_handle_t hKernel, size_t localWorkSize,
size_t dynamicSharedMemorySize, uint32_t *pGroupCountRet) {
UR_ASSERT(hKernel, UR_RESULT_ERROR_INVALID_KERNEL);
// We need to set the active current device for this kernel explicitly here,
// because the occupancy querying API does not take device parameter.
ur_device_handle_t Device = hKernel->getProgram()->getDevice();
ScopedContext Active(Device);
try {
// We need to calculate max num of work-groups using per-device semantics.
int MaxNumActiveGroupsPerCU{0};
UR_CHECK_ERROR(cuOccupancyMaxActiveBlocksPerMultiprocessor(
&MaxNumActiveGroupsPerCU, hKernel->get(), localWorkSize,
dynamicSharedMemorySize));
detail::ur::assertion(MaxNumActiveGroupsPerCU >= 0);
// Handle the case where we can't have all SMs active with at least 1 group
// per SM. In that case, the device is still able to run 1 work-group, hence
// we will manually check if it is possible with the available HW resources.
if (MaxNumActiveGroupsPerCU == 0) {
size_t MaxWorkGroupSize{};
urKernelGetGroupInfo(
hKernel, Device, UR_KERNEL_GROUP_INFO_WORK_GROUP_SIZE,
sizeof(MaxWorkGroupSize), &MaxWorkGroupSize, nullptr);
size_t MaxLocalSizeBytes{};
urDeviceGetInfo(Device, UR_DEVICE_INFO_LOCAL_MEM_SIZE,
sizeof(MaxLocalSizeBytes), &MaxLocalSizeBytes, nullptr);
if (localWorkSize > MaxWorkGroupSize ||
dynamicSharedMemorySize > MaxLocalSizeBytes ||
hasExceededMaxRegistersPerBlock(Device, hKernel, localWorkSize))
*pGroupCountRet = 0;
else
*pGroupCountRet = 1;
} else {
// Multiply by the number of SMs (CUs = compute units) on the device in
// order to retreive the total number of groups/blocks that can be
// launched.
*pGroupCountRet = Device->getNumComputeUnits() * MaxNumActiveGroupsPerCU;
}
} catch (ur_result_t Err) {
return Err;
}
return UR_RESULT_SUCCESS;
}
Is your feature request related to a problem? Please describe
There was a merged PR for [CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda
#1796 and the related issue oneapi-src/unified-runtime#1424
However, simply migrating the implementation shown below from Cuda to Hip does not work as the kernel (hKernel->get()) is not valid for the hipOccupancyMaxActiveBlocksPerMultiprocessor.
https://github.com/oneapi-src/unified-runtime/pull/1796/changes
Describe the solution you would like
No response
Describe alternatives you have considered
No response
Additional context
No response