Slurm Docker Cluster

Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. This repository simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.

🏁 Quick Start

Requirements: Docker and Docker Compose

git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster
cp .env.example .env    # optional: edit to change version, enable GPU, etc.

# Option A: Pull pre-built image from Docker Hub (fastest)
docker pull giovtorres/slurm-docker-cluster:latest
docker tag giovtorres/slurm-docker-cluster:latest slurm-docker-cluster:25.11.4

# Option B: Build from source
make build

# Then, start the cluster
make up

Supported Slurm versions: 25.11, 25.05

Supported architectures (auto-detected): AMD64, ARM64

📦 What's Included

Containers:

mysql - Job and cluster database
slurmdbd - Database daemon for accounting
slurmctld - Controller for job scheduling
slurmrestd - REST API daemon (HTTP/JSON access)
c1, c2 - CPU compute nodes (dynamically scalable)
g1 - (optional) GPU compute node with NVIDIA support (dynamically scalable)
ondemand - (optional) Open OnDemand web portal
elasticsearch - (optional) indexing jobs
kibana - (optional) visualization for elasticsearch

Persistent volumes:

Configuration (etc_slurm)
Logs (var_log_slurm)
Job files (slurm_jobdir)
Database (var_lib_mysql)
Authentication (etc_munge)
OOD user home (home_ood)

🖥️ Using the Cluster

# Access controller
make shell

# Inside controller:
sinfo                          # View cluster status
sbatch --wrap="hostname"       # Submit job
squeue                         # View queue
sacct                          # View accounting

# Or run example jobs
make run-examples

📈 Scaling

Compute nodes use Slurm's dynamic registration (slurmd -Z) and self-register with sequential hostnames (c1, c2, c3... for CPU; g1, g2... for GPU). Scale up or down at any time without rebuilding.

Scale CPU Workers

# Scale to 5 CPU workers (default is 2)
make scale-cpu-workers N=5

# Or set the default count in .env
CPU_WORKER_COUNT=4
make up

Scale GPU Workers

# Scale to 3 GPU workers (requires GPU_ENABLE=true)
make scale-gpu-workers N=3

Verify with make status.

📊 Monitoring

REST API

Query cluster via REST API (version auto-detected: v0.0.44 for 25.11.x, v0.0.42 for 25.05.x):

# Get JWT Token
JWT_TOKEN=$(docker exec slurmctld scontrol token 2>&1 | grep "SLURM_JWT=" | cut -d'=' -f2)

# Get nodes
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
  http://localhost:6820/slurm/v0.0.42/nodes | jq .nodes

# Get partitions
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
  http://localhost:6820/slurm/v0.0.42/partitions | jq .partitions

Elasticsearch and Kibana (Optional)

Enable job completion monitoring and visualization:

# 1. Setting ELASTICSEARCH_HOST in .env enables the monitoring profile
ELASTICSEARCH_HOST=http://elasticsearch:9200

# 2. Start cluster (monitoring auto-enabled)
make up

# 3. Access Kibana at http://localhost:5601
# After loading, click: Elasticsearch → Index Management → slurm → Discover index

# 4. Query job completions directly
docker exec elasticsearch curl -s "http://localhost:9200/slurm/_search?pretty"

# Test monitoring
make test-monitoring

Indexed data: Job ID, user, partition, state, times, nodes, exit code

🌐 Open OnDemand (Optional)

Enable the Open OnDemand web portal for browser-based cluster access — submit jobs, manage files, and monitor the queue without the command line:

# Enable in .env
OOD_ENABLE=true

# Build and start (OOD profile auto-enabled)
make build
make up

# Access at http://localhost:8080
# Login: ood@localhost / password

# Test OOD integration
make test-ondemand

OOD connects to the Slurm cluster as the ood user and shares a home directory across compute nodes for job I/O. OOD config files live in ood/.Changes require make rebuild.

Note: This setup uses Dex with a static password for development purposes. See the OOD authentication docs.

🎮 GPU Support (NVIDIA)

Enable optional NVIDIA GPU support using NVIDIA's official CUDA base images:

# 1. One-time host setup (add NVIDIA repo and install nvidia-container-toolkit)
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
  | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 2. Enable GPU in .env (uses NVIDIA's official CUDA base images)
GPU_ENABLE=true
BUILDER_BASE=nvidia/cuda:13.1.1-devel-rockylinux9
RUNTIME_BASE=nvidia/cuda:13.1.1-base-rockylinux9

# 3. Build with GPU support
make rebuild

# 4. Verify GPU detection
docker exec g1 nvidia-smi

# Test GPU functionality
make test-gpu

Note: GPU testing is not included in CI (GitHub-hosted runners have no GPUs). Run make test-gpu manually on a host with an NVIDIA GPU and nvidia-container-toolkit installed.

📦 Software Installation

Spack is included in the image and integrates with Lmod so installed packages appear immediately as modules. All nodes share the same Spack and module tree.

make shell

spack install python@3.14
module avail
module load python/3.14.0
python --version

Modules are also available in batch jobs without any extra setup:

sbatch --wrap="module load python/3.14.0 && python3 --version"

To add a custom modulefile outside of Spack, drop a .lua file into the opt_modulefiles volume — it appears immediately on all nodes without a restart:

docker exec slurmctld mkdir -p /opt/modulefiles/myapp
docker cp myapp/1.0.lua slurmctld:/opt/modulefiles/myapp/1.0.lua
module avail

🔄 Cluster Management

Run make to see all available commands. Common ones:

make down     # Stop cluster (keeps data)
make clean    # Remove all containers and volumes
make rebuild  # Clean, rebuild, and restart
make logs     # View container logs

🐳 Docker Hub

Pre-built multi-arch images (amd64 + arm64) are published on each GitHub release:

# CPU images
docker pull giovtorres/slurm-docker-cluster:latest
docker pull giovtorres/slurm-docker-cluster:25.11.4          # latest build for this Slurm version
docker pull giovtorres/slurm-docker-cluster:25.11.4-2.1.0   # pinned to a specific release

# GPU images (built on nvidia/cuda base)
docker pull giovtorres/slurm-docker-cluster:latest-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.4-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.4-gpu-2.1.0

⚙️ Advanced

Version Management

make set-version VER=25.05.7   # Switch Slurm version
make version                   # Show current version
make build-all                 # Build all supported versions
make test-all                  # Test all versions

Configuration Updates

# Live edit (persists across restarts)
docker exec -it slurmctld vi /etc/slurm/slurm.conf
make reload-slurm

# Push local changes
vi config/25.05/slurm.conf
make update-slurm FILES="slurm.conf"

# Permanent changes
make rebuild

Multi-Architecture Builds

# Cross-platform build (uses QEMU emulation)
docker buildx build --platform linux/arm64 \
  --build-arg SLURM_VERSION=25.05.7 \
  --load -t slurm-docker-cluster:25.05.7 .

📚 Documentation

Commands: Run make help for all available commands
Examples: Job scripts in examples/ directory

🤝 Contributing

Contributions are welcomed! Fork this repo, create a branch, and submit a pull request.

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
config		config
examples/jobs		examples/jobs
ood		ood
readme		readme
rpmbuild		rpmbuild
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.ondemand		Dockerfile.ondemand
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
run_examples.sh		run_examples.sh
test_cluster.sh		test_cluster.sh
test_gpu.sh		test_gpu.sh
test_monitoring.sh		test_monitoring.sh
test_ondemand.sh		test_ondemand.sh
update_slurmfiles.sh		update_slurmfiles.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slurm Docker Cluster

🏁 Quick Start

📦 What's Included

🖥️ Using the Cluster

📈 Scaling

Scale CPU Workers

Scale GPU Workers

📊 Monitoring

REST API

Elasticsearch and Kibana (Optional)

🌐 Open OnDemand (Optional)

🎮 GPU Support (NVIDIA)

📦 Software Installation

🔄 Cluster Management

🐳 Docker Hub

⚙️ Advanced

Version Management

Configuration Updates

Multi-Architecture Builds

📚 Documentation

🤝 Contributing

📄 License

About

Uh oh!

Releases 23

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Slurm Docker Cluster

🏁 Quick Start

📦 What's Included

🖥️ Using the Cluster

📈 Scaling

Scale CPU Workers

Scale GPU Workers

📊 Monitoring

REST API

Elasticsearch and Kibana (Optional)

🌐 Open OnDemand (Optional)

🎮 GPU Support (NVIDIA)

📦 Software Installation

🔄 Cluster Management

🐳 Docker Hub

⚙️ Advanced

Version Management

Configuration Updates

Multi-Architecture Builds

📚 Documentation

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 23

Uh oh!

Contributors

Uh oh!

Languages