Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. This repository simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.
Requirements: Docker and Docker Compose
git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster
cp .env.example .env # optional: edit to change version, enable GPU, etc.
# Option A: Pull pre-built image from Docker Hub (fastest)
docker pull giovtorres/slurm-docker-cluster:latest
docker tag giovtorres/slurm-docker-cluster:latest slurm-docker-cluster:25.11.4
# Option B: Build from source
make build
# Then, start the cluster
make upSupported Slurm versions: 25.11, 25.05
Supported architectures (auto-detected): AMD64, ARM64
Containers:
- mysql - Job and cluster database
- slurmdbd - Database daemon for accounting
- slurmctld - Controller for job scheduling
- slurmrestd - REST API daemon (HTTP/JSON access)
- c1, c2 - CPU compute nodes (dynamically scalable)
- g1 - (optional) GPU compute node with NVIDIA support (dynamically scalable)
- ondemand - (optional) Open OnDemand web portal
- elasticsearch - (optional) indexing jobs
- kibana - (optional) visualization for elasticsearch
Persistent volumes:
- Configuration (
etc_slurm) - Logs (
var_log_slurm) - Job files (
slurm_jobdir) - Database (
var_lib_mysql) - Authentication (
etc_munge) - OOD user home (
home_ood)
# Access controller
make shell
# Inside controller:
sinfo # View cluster status
sbatch --wrap="hostname" # Submit job
squeue # View queue
sacct # View accounting
# Or run example jobs
make run-examplesCompute nodes use Slurm's dynamic registration (slurmd -Z) and self-register
with sequential hostnames (c1, c2, c3... for CPU; g1, g2... for GPU). Scale up
or down at any time without rebuilding.
# Scale to 5 CPU workers (default is 2)
make scale-cpu-workers N=5
# Or set the default count in .env
CPU_WORKER_COUNT=4
make up# Scale to 3 GPU workers (requires GPU_ENABLE=true)
make scale-gpu-workers N=3Verify with make status.
Query cluster via REST API (version auto-detected: v0.0.44 for 25.11.x, v0.0.42 for 25.05.x):
# Get JWT Token
JWT_TOKEN=$(docker exec slurmctld scontrol token 2>&1 | grep "SLURM_JWT=" | cut -d'=' -f2)
# Get nodes
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
http://localhost:6820/slurm/v0.0.42/nodes | jq .nodes
# Get partitions
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
http://localhost:6820/slurm/v0.0.42/partitions | jq .partitionsEnable job completion monitoring and visualization:
# 1. Setting ELASTICSEARCH_HOST in .env enables the monitoring profile
ELASTICSEARCH_HOST=http://elasticsearch:9200
# 2. Start cluster (monitoring auto-enabled)
make up
# 3. Access Kibana at http://localhost:5601
# After loading, click: Elasticsearch โ Index Management โ slurm โ Discover index
# 4. Query job completions directly
docker exec elasticsearch curl -s "http://localhost:9200/slurm/_search?pretty"
# Test monitoring
make test-monitoringIndexed data: Job ID, user, partition, state, times, nodes, exit code
Enable the Open OnDemand web portal for browser-based cluster access โ submit jobs, manage files, and monitor the queue without the command line:
# Enable in .env
OOD_ENABLE=true
# Build and start (OOD profile auto-enabled)
make build
make up
# Access at http://localhost:8080
# Login: ood@localhost / password
# Test OOD integration
make test-ondemandOOD connects to the Slurm cluster as the ood user and shares a home directory
across compute nodes for job I/O. OOD config files live in ood/.Changes
require make rebuild.
Note: This setup uses Dex with a static password for development purposes. See the OOD authentication docs.
Enable optional NVIDIA GPU support using NVIDIA's official CUDA base images:
# 1. One-time host setup (add NVIDIA repo and install nvidia-container-toolkit)
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
| sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 2. Enable GPU in .env (uses NVIDIA's official CUDA base images)
GPU_ENABLE=true
BUILDER_BASE=nvidia/cuda:13.1.1-devel-rockylinux9
RUNTIME_BASE=nvidia/cuda:13.1.1-base-rockylinux9
# 3. Build with GPU support
make rebuild
# 4. Verify GPU detection
docker exec g1 nvidia-smi
# Test GPU functionality
make test-gpuNote: GPU testing is not included in CI (GitHub-hosted runners have no GPUs). Run
make test-gpumanually on a host with an NVIDIA GPU andnvidia-container-toolkitinstalled.
Spack is included in the image and integrates with Lmod so installed packages appear immediately as modules. All nodes share the same Spack and module tree.
make shell
spack install python@3.14
module avail
module load python/3.14.0
python --versionModules are also available in batch jobs without any extra setup:
sbatch --wrap="module load python/3.14.0 && python3 --version"To add a custom modulefile outside of Spack, drop a .lua file into the opt_modulefiles volume โ it appears immediately on all nodes without a restart:
docker exec slurmctld mkdir -p /opt/modulefiles/myapp
docker cp myapp/1.0.lua slurmctld:/opt/modulefiles/myapp/1.0.lua
module availRun make to see all available commands. Common ones:
make down # Stop cluster (keeps data)
make clean # Remove all containers and volumes
make rebuild # Clean, rebuild, and restart
make logs # View container logsPre-built multi-arch images (amd64 + arm64) are published on each GitHub release:
# CPU images
docker pull giovtorres/slurm-docker-cluster:latest
docker pull giovtorres/slurm-docker-cluster:25.11.4 # latest build for this Slurm version
docker pull giovtorres/slurm-docker-cluster:25.11.4-2.1.0 # pinned to a specific release
# GPU images (built on nvidia/cuda base)
docker pull giovtorres/slurm-docker-cluster:latest-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.4-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.4-gpu-2.1.0make set-version VER=25.05.7 # Switch Slurm version
make version # Show current version
make build-all # Build all supported versions
make test-all # Test all versions# Live edit (persists across restarts)
docker exec -it slurmctld vi /etc/slurm/slurm.conf
make reload-slurm
# Push local changes
vi config/25.05/slurm.conf
make update-slurm FILES="slurm.conf"
# Permanent changes
make rebuild# Cross-platform build (uses QEMU emulation)
docker buildx build --platform linux/arm64 \
--build-arg SLURM_VERSION=25.05.7 \
--load -t slurm-docker-cluster:25.05.7 .- Commands: Run
make helpfor all available commands - Examples: Job scripts in
examples/directory
Contributions are welcomed! Fork this repo, create a branch, and submit a pull request.
This project is licensed under the MIT License.