Slurm SPANK GPU Compute Mode plugin

The GPU Compute mode SPANK plugin for Slurm allows users to choose the compute mode of GPUs they submit jobs to.

Rationale

NVIDIA GPUs can be set to operate under different compute modes:

Default (shared): Multiple host threads can use the device at the same time.
Exclusive-process: Only one CUDA context may be created on the device across all processes in the system.
Prohibited: No CUDA context can be created on the device.

More information is available in the CUDA Programming guide

To ensure optimal performance, process isolation and avoid situations where multiple processes unintentionally run on the same GPU, it's often recommended to set the GPU compute mode to Exclusive-process.

Unfortunately, some legacy applications may require to run concurrent processes on a single GPU, to function properly, and thus need the GPUs to be set in the Default (i.e. shared) compute mode.

Hence the need for a mechanism that would allow users to choose the compute mode of the GPUs their job will run on.

Installation

Requirements:

slurm-spank-lua (submission and execution nodes)
nvidia-smi (execution nodes)

The Slurm SPANK GPU Compute Mode plugin is written in Lua and require the slurm-spank-lua plugin to work.

Once the slurm-spank-lua plugin is installed and configured, the gpu_cmode.lua script can be dropped in the appropriate directory (by default, /etc/slurm/lua.d, or any other location specified in the plugstack.conf configuration file).

Note that both the Slurm SPANK Lua plugin and the GPU Compute mode Lua plugin will need to be present on both the submission host(s) (where srun/sbatch commands are executed) and the execution host(s) (where the job actually runs).

Note: The plugin defines a default GPU compute mode (exclusive), which is used to re-set the GPUs at the end of a job. The default mode can be changed by editing the value of default_cmode in the script.

Usage

The Slurm SPANK GPU Compute Mode plugin introduces a new option to srun and sbatch: --gpu_cmode.

$ srun --help
[...]
      --gpu_cmode=<shared|exclusive|prohibited>
                  Set the GPU compute mode on the allocated GPUs to
                  shared, exclusive or prohibited. Default is
                  exclusive
[...]

Examples

Requesting `Default` compute mode

--gpu_cmode=shared

$ srun --gres gpu:1 --gpu_cmode=shared "nvidia-smi --query-gpu=compute_mode --format=csv,noheader"
Default

Requesting `Exclusive-process` compute mode

--gpu_cmode=exclusive

$ srun --gres gpu:1 --gpu_cmode=exclusive nvidia-smi --query-gpu=compute_mode --format=csv,noheader
Exclusive_Process

Requesting `prohibited` compute mode

--gpu_cmode=prohibited

$ srun --gres gpu:1 --gpu_cmode=prohibited nvidia-smi --query-gpu=compute_mode --format=csv,noheader
Prohibited

Multi-GPU job

$ srun --gres gpu:4 --gpu_cmode=shared nvidia-smi --query-gpu=compute_mode --format=csv,noheader
Default
Default
Default
Default

Multi-node job

$ srun -l -N 2 --ntasks-per-node=1 --gres gpu:1 --gpu_cmode=shared nvidia-smi --query-gpu=compute_mode --format=csv,noheader
1: Default
0: Default

NB: If the --gpu_cmode option is not used, no modification will be made to the current compute mode of the GPUs, and the site default will be used.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
gpu_cmode.lua		gpu_cmode.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

gpu_cmode.lua

gpu_cmode.lua

Repository files navigation

Slurm SPANK GPU Compute Mode plugin

Rationale

Installation

Usage

Examples

Requesting `Default` compute mode

Requesting `Exclusive-process` compute mode

Requesting `prohibited` compute mode

Multi-GPU job

Multi-node job

About

Releases 3

Packages

Contributors 2

Languages

License

stanford-rc/slurm-spank-gpu_cmode

Folders and files

Latest commit

History

Repository files navigation

Slurm SPANK GPU Compute Mode plugin

Rationale

Installation

Usage

Examples

Requesting Default compute mode

Requesting Exclusive-process compute mode

Requesting prohibited compute mode

Multi-GPU job

Multi-node job

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Requesting `Default` compute mode

Requesting `Exclusive-process` compute mode

Requesting `prohibited` compute mode