-
Notifications
You must be signed in to change notification settings - Fork 10
slurm installation and administration
The following instructions detail installing the Slurm stack--including slurmrestd--on Ubuntu, starting with Slurm version slurm-21.08.5. These instructions are based upon this excellent example Slurm installation.
There are two sets of Slurm dependencies that must be installed: (1) standard slurm stack and (2) slurmrestd
The following commands are executed as root to install all dependencies for the basic Slurm stack.
# 1. Install all base dependencies
apt-get update && apt install sudo git gcc make ruby ruby-dev python3 \
libpam0g-dev libmariadb-client-lgpl-dev libmysqlclient-dev wget vim curl -y
# 2. Install gem needed for dpkg
gem install fpm
# 3. Install munge
apt-get install libmunge-dev libmunge2 munge -y
# 4. Install hdf5
apt-get install libhdf5-serial-dev hdf5-tools -y
# 5. Install MariaDB for slurm accounting (slurmctld node only)
apt-get install mariadb-server -y
apt-get install cmake libhttp-parser-dev libyaml-dev libjson-c-dev autoconf automake \
autotools-dev libtool pkg-config libjansson-dev -y
git clone --depth 1 --single-branch -b v1.12.0 https://github.com/benmcollins/libjwt.git libjwt
cd libjwt
autoreconf --force --install
./configure --prefix=/usr/
make -j
make install
useradd -m -u 1004 slurm
It is critical to specify the slurm user id, because if the id is 1001, the slurmctld pings to slurmd processes will fail.
For slurm version 21.08.5 and beyond, the slurmrestd daemon must be run as a user other than slurm or root. Consequently, another user has to be created and configured in the slurmrestd.service file. In my case, the slurmrestd service is executed as the slurmrestd user.
useradd slurmrestd
# on all nodes
service start munge
# on slurmctld host only
service start mysql
mkdir /storage
cd /storage
wget https://download.schedmd.com/slurm/slurm-21.08.5.tar.bz2
tar xvf slurm-21.08.5.tar.bz2
cd slurm-21.08.5
./configure --prefix=/storage/slurm-build --sysconfdir=/etc/slurm --enable-pam \
--with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm \
--with-http-parser=/usr/ --with-yaml=/usr/ --with-jwt=/usr/ \
--enable-slurmrestd
make
make contrib
make install
cd /storage
fpm -s dir -t deb -v 1.0 -n slurm-21.08.5 --prefix=/usr -C /storage/slurm-build .
dpkg -i slurm-21.08.5_1.0_amd64.deb
The new configure command is as follows:
./configure --prefix=/storage/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --with-http-parser=/usr/ --with-yaml=/usr/ --with-jwt=/usr/
Note that the --without-shared-libslurm argumnent is now omitted. If --without-shared-libslurm remains in the configure params list, a difficult-to-debug compile error such as below is produced:
libtool: link: ranlib .libs/libslurmrest_ref.a
libtool: link: ( cd ".libs" && rm -f "libslurmrest_ref.la" && ln -s "../libslurmrest_ref.la" "libslurmrest_ref.la" )
/bin/bash ../../libtool --tag=CC --mode=link gcc -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -export-dynamic -o slurmrestd http.o operations.o slurmrestd.o rest_auth.o ../../src/api/libslurm.o -ldl -L/usr/lib64 -lhttp_parser libslurmrest_ref.la -lpthread -lm -lresolv
libtool: link: gcc -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -o slurmrestd http.o operations.o slurmrestd.o rest_auth.o ../../src/api/libslurm.o -Wl,--export-dynamic -ldl -L/usr/lib64 -lhttp_parser ./.libs/libslurmrest_ref.a -lpthread -lm -lresolv -pthread
/usr/bin/ld: ../../src/api/libslurm.o: in function `parse_host_port':
/storage/slurm-22.05.3/src/common/http.c:308: multiple definition of `parse_host_port'; http.o:/storage/slurm-22.05.3/src/slurmrestd/http.c:803: first defined here
/usr/bin/ld: ../../src/api/libslurm.o: in function `free_parse_host_port':
/storage/slurm-22.05.3/src/common/http.c:314: multiple definition of `free_parse_host_port'; http.o:/storage/slurm-22.05.3/src/slurmrestd/http.c:856: first defined here
collect2: error: ld returned 1 exit status
make[4]: *** [Makefile:642: slurmrestd] Error 1
make[4]: Leaving directory '/storage/slurm-22.05.3/src/slurmrestd'
make[3]: *** [Makefile:695: all-recursive] Error 1
make[3]: Leaving directory '/storage/slurm-22.05.3/src/slurmrestd'
make[2]: *** [Makefile:545: all-recursive] Error 1
make[2]: Leaving directory '/storage/slurm-22.05.3/src'
make[1]: *** [Makefile:624: all-recursive] Error 1
make[1]: Leaving directory '/storage/slurm-22.05.3'
Also of note, slurmrestd is now built by default and must be explicitly disabled via the --disable-slurmrestd configure parameter.
root@robinson:/storage# slurmrestd -a list
slurmrestd: Possible REST authentication plugins:
slurmrestd: rest_auth/local
slurmrestd: rest_auth/jwt
The mariadb installation is different in Ubuntu. Consequently, the Slurm dependency installation in Ubuntu 22.04 is as follows:
apt-get update && apt install sudo git gcc make ruby ruby-dev python3 libpam0g-dev libmysqlclient-dev mariadb-client wget vim curl -y
mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld \
/var/spool/slurm/d /var/log/slurm /var/run/slurm
chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /var/run/slurm
# on all nodes
git clone https://github.com/mknoxnv/ubuntu-slurm.git
cp ubuntu-slurm/slurm.conf /etc/slurm/
# on slurmd node
cp ubuntu-slurm/slurmd.init /etc/init.d/slurmd
cp ubuntu-slurm/slurm.default /etc/default/slurm
chmod 755 /etc/init.d/slurmd
cp ubuntu-slurm/slurmd.service /etc/systemd/system/
# on slurmctld node
cp ubuntu-slurm/slurmdbd.init /etc/init.d/slurmdbd
chmod 755 /etc/init.d/slurmdbd
cp ubuntu-slurm/slurmdbd.service /etc/systemd/system/
cp ubuntu-slurm/slurmdbd.conf /etc/slurm/
cp ubuntu-slurm/slurmctld.service /etc/systemd/system/
chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /var/run/slurm
rm -rf slurm-21.08.5.tar.bz2
rm -rf slurm-21.08.5_1.0_amd64.deb
dd if=/dev/random of=/etc/slurm/jwt_hs256.key bs=32 count=1
chown slurm:slurm /etc/slurm/jwt_hs256.key
chmod 600 /etc/slurm/jwt_hs256.key
chown slurm:slurm /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
Enable as well as start slurmdbd service and execute the initialize-mariadb.sh script as follows:
systemctl enable slurmdbd
systemctl start slurmdbd
mysql -u root < initialize-mariadb.sh
# On slurmctld host
sudo cp /etc/munge/munge.key /tmp
sudo chmod 644 /tmp/munge.key
scp /tmp/munge.key <slurm worker>:/tmp
sudo rm -rf /tmp/munge.key
# On each slurmd host
sudo cp /tmp/munge.key /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl restart munge
scp /etc/slurm/slurm.conf <slurmd/slurmrestd host>:/etc/slurm/
Copy the slumctld.service to the Slurm master node(s), slurmd.servivce to all worker nodes, and slurmrestd.service to the slurmrestd host.
# On slurmctld node
systemctl enable slurmctld
systemctl start slurmctld
# Ensure slurmctld is running, then start slurmd service on Slurm workers
# On slurmd nodes
systemctl enable slurmd
systemctl start slurmd
# Ensure slurmctld is running on Slurm masterand slurmd are running on Slurm workers, then start slurmrestd service
# On Slurm slurmrestd node
systemctl enable slurmrestd
systemctl start slurmrestd
Upgrading slurm basically involves two steps: (1) remove the all existing Slurm components and (2) install new Slurm version
# get slurm version
slurmctld -V
slurm-20.02.7
# remove all slurm components
dpkg -P slurm-20.02.7
If a compute (slurmd) node remains in down mode, manually reset it to idle state as follows:
scontrol update nodename=<node name> state=idle
Examine the /var/log/slurm/slurmd.log, which usually contains clues as to why the slurmd process fails to acknowledge pings from slurmctld. One error message I say recently is as follows:
[2022-01-22T11:13:57.029] error: Error creating slurm stream socket: Address family not supported by protocol
[2022-01-22T11:13:57.029] error: Unable to bind listen port (6818): Address family not supported by protocol
This error message indicates a startup race condition WRT the network interface. The fix for this was reported by David Bremner. Specifically, updated the slurmd.service file by adding the following network.online-target clauses:
After=network-online.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
Wants=network-online.target
slurmd: error: Unable to register: Zero Bytes were transmitted or received
slurmd: error: Munge decode failed: Invalid credential
slurmd: auth/munge: _print_cred: ENCODED: Wed Dec 31 19:00:00 1969
slurmd: auth/munge: _print_cred: DECODED: Wed Dec 31 19:00:00 1969
slurmd: error: slurm_receive_msg_and_forward: [[robinson]:47716] auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
slurmd: error: slurm_receive_msg_and_forward: [[robinson]:47716] failed: Protocol authentication error
slurmd: error: service_connection: slurm_receive_msg: Protocol authentication error
slurmd: error: Unable to register: Zero Bytes were transmitted or received
This means one of the following is true:
- munge.key does not match the slurm controller node
- munge user is not munge.key file owner
- munge.key permissions are not 600
slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed
To fix, add cgroup.conf file w/ the following content to /etc/slurm directory:
CgroupPlugin=cgroup/v1