Skip to content

giovtorres/slurm-docker-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

72 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Slurm Docker Cluster

Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. This repository simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.

๐Ÿ Quick Start

Requirements: Docker and Docker Compose

git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster
cp .env.example .env    # optional: edit to change version, enable GPU, etc.

# Option A: Pull pre-built image from Docker Hub (fastest)
docker pull giovtorres/slurm-docker-cluster:latest
docker tag giovtorres/slurm-docker-cluster:latest slurm-docker-cluster:25.11.4

# Option B: Build from source
make build

# Then, start the cluster
make up

Supported Slurm versions: 25.11, 25.05

Supported architectures (auto-detected): AMD64, ARM64

๐Ÿ“ฆ What's Included

Containers:

  • mysql - Job and cluster database
  • slurmdbd - Database daemon for accounting
  • slurmctld - Controller for job scheduling
  • slurmrestd - REST API daemon (HTTP/JSON access)
  • c1, c2 - CPU compute nodes (dynamically scalable)
  • g1 - (optional) GPU compute node with NVIDIA support (dynamically scalable)
  • ondemand - (optional) Open OnDemand web portal
  • elasticsearch - (optional) indexing jobs
  • kibana - (optional) visualization for elasticsearch

Persistent volumes:

  • Configuration (etc_slurm)
  • Logs (var_log_slurm)
  • Job files (slurm_jobdir)
  • Database (var_lib_mysql)
  • Authentication (etc_munge)
  • OOD user home (home_ood)

๐Ÿ–ฅ๏ธ Using the Cluster

# Access controller
make shell

# Inside controller:
sinfo                          # View cluster status
sbatch --wrap="hostname"       # Submit job
squeue                         # View queue
sacct                          # View accounting

# Or run example jobs
make run-examples

๐Ÿ“ˆ Scaling

Compute nodes use Slurm's dynamic registration (slurmd -Z) and self-register with sequential hostnames (c1, c2, c3... for CPU; g1, g2... for GPU). Scale up or down at any time without rebuilding.

Scale CPU Workers

# Scale to 5 CPU workers (default is 2)
make scale-cpu-workers N=5

# Or set the default count in .env
CPU_WORKER_COUNT=4
make up

Scale GPU Workers

# Scale to 3 GPU workers (requires GPU_ENABLE=true)
make scale-gpu-workers N=3

Verify with make status.

๐Ÿ“Š Monitoring

REST API

Query cluster via REST API (version auto-detected: v0.0.44 for 25.11.x, v0.0.42 for 25.05.x):

# Get JWT Token
JWT_TOKEN=$(docker exec slurmctld scontrol token 2>&1 | grep "SLURM_JWT=" | cut -d'=' -f2)

# Get nodes
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
  http://localhost:6820/slurm/v0.0.42/nodes | jq .nodes

# Get partitions
docker exec slurmrestd curl -s -H "X-SLURM-USER-TOKEN: $JWT_TOKEN" \
  http://localhost:6820/slurm/v0.0.42/partitions | jq .partitions

Elasticsearch and Kibana (Optional)

Enable job completion monitoring and visualization:

# 1. Setting ELASTICSEARCH_HOST in .env enables the monitoring profile
ELASTICSEARCH_HOST=http://elasticsearch:9200

# 2. Start cluster (monitoring auto-enabled)
make up

# 3. Access Kibana at http://localhost:5601
# After loading, click: Elasticsearch โ†’ Index Management โ†’ slurm โ†’ Discover index

# 4. Query job completions directly
docker exec elasticsearch curl -s "http://localhost:9200/slurm/_search?pretty"

# Test monitoring
make test-monitoring

Indexed data: Job ID, user, partition, state, times, nodes, exit code

๐ŸŒ Open OnDemand (Optional)

Enable the Open OnDemand web portal for browser-based cluster access โ€” submit jobs, manage files, and monitor the queue without the command line:

# Enable in .env
OOD_ENABLE=true

# Build and start (OOD profile auto-enabled)
make build
make up

# Access at http://localhost:8080
# Login: ood@localhost / password

# Test OOD integration
make test-ondemand

OOD connects to the Slurm cluster as the ood user and shares a home directory across compute nodes for job I/O. OOD config files live in ood/.Changes require make rebuild.

Note: This setup uses Dex with a static password for development purposes. See the OOD authentication docs.

๐ŸŽฎ GPU Support (NVIDIA)

Enable optional NVIDIA GPU support using NVIDIA's official CUDA base images:

# 1. One-time host setup (add NVIDIA repo and install nvidia-container-toolkit)
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
  | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 2. Enable GPU in .env (uses NVIDIA's official CUDA base images)
GPU_ENABLE=true
BUILDER_BASE=nvidia/cuda:13.1.1-devel-rockylinux9
RUNTIME_BASE=nvidia/cuda:13.1.1-base-rockylinux9

# 3. Build with GPU support
make rebuild

# 4. Verify GPU detection
docker exec g1 nvidia-smi

# Test GPU functionality
make test-gpu

Note: GPU testing is not included in CI (GitHub-hosted runners have no GPUs). Run make test-gpu manually on a host with an NVIDIA GPU and nvidia-container-toolkit installed.

๐Ÿ“ฆ Software Installation

Spack is included in the image and integrates with Lmod so installed packages appear immediately as modules. All nodes share the same Spack and module tree.

make shell

spack install python@3.14
module avail
module load python/3.14.0
python --version

Modules are also available in batch jobs without any extra setup:

sbatch --wrap="module load python/3.14.0 && python3 --version"

To add a custom modulefile outside of Spack, drop a .lua file into the opt_modulefiles volume โ€” it appears immediately on all nodes without a restart:

docker exec slurmctld mkdir -p /opt/modulefiles/myapp
docker cp myapp/1.0.lua slurmctld:/opt/modulefiles/myapp/1.0.lua
module avail

๐Ÿ”„ Cluster Management

Run make to see all available commands. Common ones:

make down     # Stop cluster (keeps data)
make clean    # Remove all containers and volumes
make rebuild  # Clean, rebuild, and restart
make logs     # View container logs

๐Ÿณ Docker Hub

Pre-built multi-arch images (amd64 + arm64) are published on each GitHub release:

# CPU images
docker pull giovtorres/slurm-docker-cluster:latest
docker pull giovtorres/slurm-docker-cluster:25.11.4          # latest build for this Slurm version
docker pull giovtorres/slurm-docker-cluster:25.11.4-2.1.0   # pinned to a specific release

# GPU images (built on nvidia/cuda base)
docker pull giovtorres/slurm-docker-cluster:latest-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.4-gpu
docker pull giovtorres/slurm-docker-cluster:25.11.4-gpu-2.1.0

โš™๏ธ Advanced

Version Management

make set-version VER=25.05.7   # Switch Slurm version
make version                   # Show current version
make build-all                 # Build all supported versions
make test-all                  # Test all versions

Configuration Updates

# Live edit (persists across restarts)
docker exec -it slurmctld vi /etc/slurm/slurm.conf
make reload-slurm

# Push local changes
vi config/25.05/slurm.conf
make update-slurm FILES="slurm.conf"

# Permanent changes
make rebuild

Multi-Architecture Builds

# Cross-platform build (uses QEMU emulation)
docker buildx build --platform linux/arm64 \
  --build-arg SLURM_VERSION=25.05.7 \
  --load -t slurm-docker-cluster:25.05.7 .

๐Ÿ“š Documentation

  • Commands: Run make help for all available commands
  • Examples: Job scripts in examples/ directory

๐Ÿค Contributing

Contributions are welcomed! Fork this repo, create a branch, and submit a pull request.

๐Ÿ“„ License

This project is licensed under the MIT License.