Skip to content

slurm installation and administration

hokiegeek2 edited this page Oct 7, 2022 · 8 revisions

Slurm Installation on Ubuntu

Background

The following instructions detail installing the Slurm stack--including slurmrestd--on Ubuntu, starting with Slurm version slurm-21.08.5. These instructions are based upon this excellent example Slurm installation.

Installing Slurm Dependencies

There are two sets of Slurm dependencies that must be installed: (1) standard slurm stack and (2) slurmrestd

Install standard Slurm dependencies

The following commands are executed as root to install all dependencies for the basic Slurm stack.

# 1. Install all base dependencies
apt-get update && apt install sudo git gcc make ruby ruby-dev python3 \
    libpam0g-dev libmariadb-client-lgpl-dev libmysqlclient-dev wget vim curl -y

# 2. Install gem needed for dpkg
gem install fpm

# 3. Install munge
apt-get install libmunge-dev libmunge2 munge -y

# 4. Install hdf5
apt-get install libhdf5-serial-dev hdf5-tools -y

# 5. Install MariaDB for slurm accounting (slurmctld node only)
apt-get install mariadb-server -y

Install slurmrestd dependencies

apt-get install cmake libhttp-parser-dev libyaml-dev libjson-c-dev autoconf automake \
        autotools-dev libtool  pkg-config libjansson-dev -y

git clone --depth 1 --single-branch -b v1.12.0 https://github.com/benmcollins/libjwt.git libjwt
cd libjwt
autoreconf --force --install
./configure --prefix=/usr/
make -j
make install

Add slurm user (all nodes)

useradd -m -u 1004 slurm

It is critical to specify the slurm user id, because if the id is 1001, the slurmctld pings to slurmd processes will fail.

Add slurmrestd user (slurmrestd node only)

For slurm version 21.08.5 and beyond, the slurmrestd daemon must be run as a user other than slurm or root. Consequently, another user has to be created and configured in the slurmrestd.service file. In my case, the slurmrestd service is executed as the slurmrestd user.

useradd slurmrestd

Start dependent services

# on all nodes
service start munge

# on slurmctld host only
service start mysql

Install Slurm

mkdir /storage
cd /storage

wget https://download.schedmd.com/slurm/slurm-21.08.5.tar.bz2
tar xvf slurm-21.08.5.tar.bz2
cd slurm-21.08.5

./configure --prefix=/storage/slurm-build --sysconfdir=/etc/slurm --enable-pam \
    --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm \
    --with-http-parser=/usr/ --with-yaml=/usr/ --with-jwt=/usr/ \
    --enable-slurmrestd 

make 
make contrib
make install

cd /storage
fpm -s dir -t deb -v 1.0 -n slurm-21.08.5 --prefix=/usr -C /storage/slurm-build .
dpkg -i slurm-21.08.5_1.0_amd64.deb

New for Slurm 22.05.x: Build and Install Instructions

configure Command

The new configure command is as follows:

./configure --prefix=/storage/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --with-http-parser=/usr/ --with-yaml=/usr/ --with-jwt=/usr/

Note that the --without-shared-libslurm argumnent is now omitted. If --without-shared-libslurm remains in the configure params list, a difficult-to-debug compile error such as below is produced:

libtool: link: ranlib .libs/libslurmrest_ref.a
libtool: link: ( cd ".libs" && rm -f "libslurmrest_ref.la" && ln -s "../libslurmrest_ref.la" "libslurmrest_ref.la" )
/bin/bash ../../libtool  --tag=CC   --mode=link gcc  -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -export-dynamic   -o slurmrestd http.o operations.o slurmrestd.o rest_auth.o ../../src/api/libslurm.o -ldl -L/usr/lib64 -lhttp_parser libslurmrest_ref.la -lpthread -lm -lresolv 
libtool: link: gcc -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -o slurmrestd http.o operations.o slurmrestd.o rest_auth.o ../../src/api/libslurm.o -Wl,--export-dynamic  -ldl -L/usr/lib64 -lhttp_parser ./.libs/libslurmrest_ref.a -lpthread -lm -lresolv -pthread
/usr/bin/ld: ../../src/api/libslurm.o: in function `parse_host_port':
/storage/slurm-22.05.3/src/common/http.c:308: multiple definition of `parse_host_port'; http.o:/storage/slurm-22.05.3/src/slurmrestd/http.c:803: first defined here
/usr/bin/ld: ../../src/api/libslurm.o: in function `free_parse_host_port':
/storage/slurm-22.05.3/src/common/http.c:314: multiple definition of `free_parse_host_port'; http.o:/storage/slurm-22.05.3/src/slurmrestd/http.c:856: first defined here
collect2: error: ld returned 1 exit status
make[4]: *** [Makefile:642: slurmrestd] Error 1
make[4]: Leaving directory '/storage/slurm-22.05.3/src/slurmrestd'
make[3]: *** [Makefile:695: all-recursive] Error 1
make[3]: Leaving directory '/storage/slurm-22.05.3/src/slurmrestd'
make[2]: *** [Makefile:545: all-recursive] Error 1
make[2]: Leaving directory '/storage/slurm-22.05.3/src'
make[1]: *** [Makefile:624: all-recursive] Error 1
make[1]: Leaving directory '/storage/slurm-22.05.3'

Also of note, slurmrestd is now built by default and must be explicitly disabled via the --disable-slurmrestd configure parameter.

Confirming slurmrestd Authentication Plugins are Present:

root@robinson:/storage# slurmrestd -a list
slurmrestd: Possible REST authentication plugins:
slurmrestd: rest_auth/local
slurmrestd: rest_auth/jwt

New for Slurm 22.04

slurmctld/slurmd Dependency Installation

The mariadb installation is different in Ubuntu. Consequently, the Slurm dependency installation in Ubuntu 22.04 is as follows:

apt-get update && apt install sudo git gcc make ruby ruby-dev python3 libpam0g-dev libmysqlclient-dev mariadb-client wget vim curl -y

Create required config as well as runtime directories and set permissions

mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld \
    /var/spool/slurm/d /var/log/slurm /var/run/slurm
chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /var/run/slurm

Download and copy example slurm conf and service files

# on all nodes
git clone https://github.com/mknoxnv/ubuntu-slurm.git
cp ubuntu-slurm/slurm.conf /etc/slurm/


# on slurmd node
cp ubuntu-slurm/slurmd.init /etc/init.d/slurmd
cp ubuntu-slurm/slurm.default /etc/default/slurm
chmod 755 /etc/init.d/slurmd
cp ubuntu-slurm/slurmd.service /etc/systemd/system/

# on slurmctld node
cp ubuntu-slurm/slurmdbd.init /etc/init.d/slurmdbd
chmod 755 /etc/init.d/slurmdbd
cp ubuntu-slurm/slurmdbd.service /etc/systemd/system/
cp ubuntu-slurm/slurmdbd.conf /etc/slurm/
cp ubuntu-slurm/slurmctld.service /etc/systemd/system/

chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /var/run/slurm

Remove slurm install files

rm -rf slurm-21.08.5.tar.bz2
rm -rf slurm-21.08.5_1.0_amd64.deb 

Generate slurm jwt key and set ownership/permissions (slurmctld and slurmrestd hosts only)

dd if=/dev/random of=/etc/slurm/jwt_hs256.key bs=32 count=1
chown slurm:slurm /etc/slurm/jwt_hs256.key
chmod 600 /etc/slurm/jwt_hs256.key

Set ownership/permissions for slurmdbd conf (slurmctld host only, as of 20.11.8)

chown slurm:slurm /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf

Initialize mysql database (slurmctld host only)

Enable as well as start slurmdbd service and execute the initialize-mariadb.sh script as follows:

systemctl enable slurmdbd
systemctl start slurmdbd
mysql -u root < initialize-mariadb.sh

Sync munge keys across cluster

# On slurmctld host
sudo cp /etc/munge/munge.key /tmp
sudo chmod 644 /tmp/munge.key
scp /tmp/munge.key <slurm worker>:/tmp
sudo rm -rf /tmp/munge.key

# On each slurmd host
sudo cp /tmp/munge.key /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl restart munge

Sync slurm.conf across all Slurm nodes

scp /etc/slurm/slurm.conf <slurmd/slurmrestd host>:/etc/slurm/

Install and enable Slurm services

Copy the slumctld.service to the Slurm master node(s), slurmd.servivce to all worker nodes, and slurmrestd.service to the slurmrestd host.

# On slurmctld node
systemctl enable slurmctld
systemctl start slurmctld

# Ensure slurmctld is running, then start slurmd service on Slurm workers

# On slurmd nodes
systemctl enable slurmd
systemctl start slurmd

# Ensure slurmctld is running on Slurm masterand slurmd are running on Slurm workers, then start slurmrestd service

# On Slurm slurmrestd node
systemctl enable slurmrestd
systemctl start slurmrestd

Slurm Upgrades

Upgrading slurm basically involves two steps: (1) remove the all existing Slurm components and (2) install new Slurm version

Removing installed Slurm components

# get slurm version
slurmctld -V 
slurm-20.02.7

# remove all slurm components
dpkg -P slurm-20.02.7

Slurm Administration

Compute Node(s) Stuck in Down Mode

If a compute (slurmd) node remains in down mode, manually reset it to idle state as follows:

scontrol update nodename=<node name> state=idle

Compute Node(s) Down, slurmd Running

Examine the /var/log/slurm/slurmd.log, which usually contains clues as to why the slurmd process fails to acknowledge pings from slurmctld. One error message I say recently is as follows:

[2022-01-22T11:13:57.029] error: Error creating slurm stream socket: Address family not supported by protocol
[2022-01-22T11:13:57.029] error: Unable to bind listen port (6818): Address family not supported by protocol

This error message indicates a startup race condition WRT the network interface. The fix for this was reported by David Bremner. Specifically, updated the slurmd.service file by adding the following network.online-target clauses:

After=network-online.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
Wants=network-online.target

Zero Bytes Transmitted

slurmd: error: Unable to register: Zero Bytes were transmitted or received
slurmd: error: Munge decode failed: Invalid credential
slurmd: auth/munge: _print_cred: ENCODED: Wed Dec 31 19:00:00 1969
slurmd: auth/munge: _print_cred: DECODED: Wed Dec 31 19:00:00 1969
slurmd: error: slurm_receive_msg_and_forward: [[robinson]:47716] auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
slurmd: error: slurm_receive_msg_and_forward: [[robinson]:47716] failed: Protocol authentication error
slurmd: error: service_connection: slurm_receive_msg: Protocol authentication error
slurmd: error: Unable to register: Zero Bytes were transmitted or received

This means one of the following is true:

  1. munge.key does not match the slurm controller node
  2. munge user is not munge.key file owner
  3. munge.key permissions are not 600

cgroup/v2 Error

slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed

To fix, add cgroup.conf file w/ the following content to /etc/slurm directory:

CgroupPlugin=cgroup/v1