Skip to content

urKernelSuggestMaxCooperativeGroupCountExp for Hip #21803

@zjin-lcf

Description

@zjin-lcf

Is your feature request related to a problem? Please describe

There was a merged PR for [CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda
#1796 and the related issue oneapi-src/unified-runtime#1424

However, simply migrating the implementation shown below from Cuda to Hip does not work as the kernel (hKernel->get()) is not valid for the hipOccupancyMaxActiveBlocksPerMultiprocessor.

https://github.com/oneapi-src/unified-runtime/pull/1796/changes

UR_APIEXPORT ur_result_t UR_APICALL urKernelSuggestMaxCooperativeGroupCountExp(
    ur_kernel_handle_t hKernel, size_t localWorkSize,
    size_t dynamicSharedMemorySize, uint32_t *pGroupCountRet) {
  UR_ASSERT(hKernel, UR_RESULT_ERROR_INVALID_KERNEL);

  // We need to set the active current device for this kernel explicitly here,
  // because the occupancy querying API does not take device parameter.
  ur_device_handle_t Device = hKernel->getProgram()->getDevice();
  ScopedContext Active(Device);
  try {
    // We need to calculate max num of work-groups using per-device semantics.

    int MaxNumActiveGroupsPerCU{0};
    UR_CHECK_ERROR(cuOccupancyMaxActiveBlocksPerMultiprocessor(
        &MaxNumActiveGroupsPerCU, hKernel->get(), localWorkSize,
        dynamicSharedMemorySize));
    detail::ur::assertion(MaxNumActiveGroupsPerCU >= 0);
    // Handle the case where we can't have all SMs active with at least 1 group
    // per SM. In that case, the device is still able to run 1 work-group, hence
    // we will manually check if it is possible with the available HW resources.
    if (MaxNumActiveGroupsPerCU == 0) {
      size_t MaxWorkGroupSize{};
      urKernelGetGroupInfo(
          hKernel, Device, UR_KERNEL_GROUP_INFO_WORK_GROUP_SIZE,
          sizeof(MaxWorkGroupSize), &MaxWorkGroupSize, nullptr);
      size_t MaxLocalSizeBytes{};
      urDeviceGetInfo(Device, UR_DEVICE_INFO_LOCAL_MEM_SIZE,
                      sizeof(MaxLocalSizeBytes), &MaxLocalSizeBytes, nullptr);
      if (localWorkSize > MaxWorkGroupSize ||
          dynamicSharedMemorySize > MaxLocalSizeBytes ||
          hasExceededMaxRegistersPerBlock(Device, hKernel, localWorkSize))
        *pGroupCountRet = 0;
      else
        *pGroupCountRet = 1;
    } else {
      // Multiply by the number of SMs (CUs = compute units) on the device in
      // order to retreive the total number of groups/blocks that can be
      // launched.
      *pGroupCountRet = Device->getNumComputeUnits() * MaxNumActiveGroupsPerCU;
    }
  } catch (ur_result_t Err) {
    return Err;
  }
  return UR_RESULT_SUCCESS;
}

Describe the solution you would like

No response

Describe alternatives you have considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions