urKernelSuggestMaxCooperativeGroupCountExp for Hip

### Is your feature request related to a problem? Please describe

There was a merged PR for [CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda
#1796   and the related issue https://github.com/oneapi-src/unified-runtime/issues/1424

However, simply migrating the implementation shown below from Cuda to Hip does not work as the kernel (hKernel->get()) is not valid for the hipOccupancyMaxActiveBlocksPerMultiprocessor.  

https://github.com/oneapi-src/unified-runtime/pull/1796/changes

```
UR_APIEXPORT ur_result_t UR_APICALL urKernelSuggestMaxCooperativeGroupCountExp(
    ur_kernel_handle_t hKernel, size_t localWorkSize,
    size_t dynamicSharedMemorySize, uint32_t *pGroupCountRet) {
  UR_ASSERT(hKernel, UR_RESULT_ERROR_INVALID_KERNEL);

  // We need to set the active current device for this kernel explicitly here,
  // because the occupancy querying API does not take device parameter.
  ur_device_handle_t Device = hKernel->getProgram()->getDevice();
  ScopedContext Active(Device);
  try {
    // We need to calculate max num of work-groups using per-device semantics.

    int MaxNumActiveGroupsPerCU{0};
    UR_CHECK_ERROR(cuOccupancyMaxActiveBlocksPerMultiprocessor(
        &MaxNumActiveGroupsPerCU, hKernel->get(), localWorkSize,
        dynamicSharedMemorySize));
    detail::ur::assertion(MaxNumActiveGroupsPerCU >= 0);
    // Handle the case where we can't have all SMs active with at least 1 group
    // per SM. In that case, the device is still able to run 1 work-group, hence
    // we will manually check if it is possible with the available HW resources.
    if (MaxNumActiveGroupsPerCU == 0) {
      size_t MaxWorkGroupSize{};
      urKernelGetGroupInfo(
          hKernel, Device, UR_KERNEL_GROUP_INFO_WORK_GROUP_SIZE,
          sizeof(MaxWorkGroupSize), &MaxWorkGroupSize, nullptr);
      size_t MaxLocalSizeBytes{};
      urDeviceGetInfo(Device, UR_DEVICE_INFO_LOCAL_MEM_SIZE,
                      sizeof(MaxLocalSizeBytes), &MaxLocalSizeBytes, nullptr);
      if (localWorkSize > MaxWorkGroupSize ||
          dynamicSharedMemorySize > MaxLocalSizeBytes ||
          hasExceededMaxRegistersPerBlock(Device, hKernel, localWorkSize))
        *pGroupCountRet = 0;
      else
        *pGroupCountRet = 1;
    } else {
      // Multiply by the number of SMs (CUs = compute units) on the device in
      // order to retreive the total number of groups/blocks that can be
      // launched.
      *pGroupCountRet = Device->getNumComputeUnits() * MaxNumActiveGroupsPerCU;
    }
  } catch (ur_result_t Err) {
    return Err;
  }
  return UR_RESULT_SUCCESS;
}
```



### Describe the solution you would like

_No response_

### Describe alternatives you have considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urKernelSuggestMaxCooperativeGroupCountExp for Hip #21803

Is your feature request related to a problem? Please describe

Describe the solution you would like

Describe alternatives you have considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

urKernelSuggestMaxCooperativeGroupCountExp for Hip #21803

Description

Is your feature request related to a problem? Please describe

Describe the solution you would like

Describe alternatives you have considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions