Skip to content

Conversation

@wpoely86
Copy link
Contributor

If you build torque with GPU support and cgroups, it doesn't work: the whitelisting of devices in cgroups is not done correctly.

The core of the problem is the difference between initialize_hwloc_topology and cg_initialize_hwloc_topology. The code assumes that it is sufficient to call just one of them which is not true: the function read_all_devices is only called from the cg_initialize_hwloc_topology. I've merged both functions together and the difference between both are handled by macro's. I'm not sure what the idea was to duplicate this function? Copy&pasting code like that is only going to give problems, as this issue shows.

Please backport this to the 6.1.1.1.

Both functions differ only in a call to `read_all_devices` and it was
assumed that if was sufficient to call only one of them. This is not
the case (because `read_all_devices` is required with GPU's and
cgroups).

The difference between both functions in now part of macro's.
@wpoely86
Copy link
Contributor Author

@acvizi It would be nice if you could also merge this one. Without it we cannot get torque & GPUs to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant