Skip to content

Commit

Permalink
Support compute node rebuild/reboot via Slurm RebootProgram (#553)
Browse files Browse the repository at this point in the history
* add rebuild role to appliance and modify groupvars

* improve readability of group_vars

* Define login nodes using an opentofu module (#547)

* define login nodes using tf module

* Apply suggestions from code review

Co-authored-by: Matt Anson <[email protected]>

* tweak README to explain compute groups

* try to clarify login/compute groups

---------

Co-authored-by: Matt Anson <[email protected]>

* Change docs/ references from Terraform to OpenTofu (#544)

* change terraform references to opentofu in docs

* remove wider reference to terraform

* Update environments/README.md

Co-authored-by: Steve Brasier <[email protected]>

* Update environments/common/README.md

Co-authored-by: Steve Brasier <[email protected]>

---------

Co-authored-by: Steve Brasier <[email protected]>

* fix instance_id in compute inventory to be target image, not deployed image

* review all roles for compute_init_enable

* fix permissions to /exports/cluster

* make openhpc_config more greppable

* Set ResumeTimeout and ReturnToService overrides in group_vars

* CI tests for reboot via slurm (without rebuild)

* fpinrocky 8 pythoolsvenv version

* refining comments and task names

* rebuild role readme

---------

Co-authored-by: Steve Brasier <[email protected]>
Co-authored-by: Matt Anson <[email protected]>
Co-authored-by: Steve Brasier <[email protected]>
  • Loading branch information
4 people authored Feb 11, 2025
1 parent 7c831c7 commit 112aa6e
Show file tree
Hide file tree
Showing 20 changed files with 233 additions and 72 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -178,12 +178,13 @@ jobs:
ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml
- name: Test reimage of compute nodes and compute-init (via rebuild adhoc)
- name: Test compute node reboot and compute-init
run: |
. venv/bin/activate
. environments/.stackhpc/activate
ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml
ansible-playbook -v ansible/ci/check_slurm.yml
ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
- name: Check sacct state survived reimage
run: |
Expand Down
2 changes: 2 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,5 @@ roles/*
!roles/slurm_stats/**
!roles/pytools/
!roles/pytools/**
!roles/rebuild/
!roles/rebuild/**
24 changes: 24 additions & 0 deletions ansible/adhoc/reboot_via_slurm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Reboot compute nodes via slurm. Nodes will be rebuilt if `image_id` in inventory is different to the currently-provisioned image.
# Example:
# ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml

- hosts: login
run_once: true
become: yes
gather_facts: no
tasks:
- name: Submit a Slurm job to reboot compute nodes
ansible.builtin.shell: |
set -e
srun --reboot -N 2 uptime
become_user: root
register: slurm_result
failed_when: slurm_result.rc != 0

- name: Fetch Slurm controller logs if reboot fails
ansible.builtin.shell: |
journalctl -u slurmctld --since "10 minutes ago" | tail -n 50
become_user: root
register: slurm_logs
when: slurm_result.rc != 0
delegate_to: "{{ groups['control'] | first }}"
127 changes: 103 additions & 24 deletions ansible/roles/compute_init/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,104 @@
# EXPERIMENTAL: compute-init

Experimental / in-progress functionality to allow compute nodes to rejoin the
cluster after a reboot.

To enable this add compute nodes (or a subset of them into) the `compute_init`
group.

# EXPERIMENTAL: compute_init

Experimental functionality to allow compute nodes to rejoin the cluster after
a reboot without running the `ansible/site.yml` playbook.

To enable this:
1. Add the `compute` group (or a subset) into the `compute_init` group. This is
the default when using cookiecutter to create an environment, via the
"everything" template.
2. Build an image which includes the `compute_init` group. This is the case
for StackHPC-built release images.
3. Enable the required functionalities during boot, by setting the
`compute_init_enable` property for a compute group in the
OpenTofu `compute` variable to a list which includes "compute", plus the
other roles/functionalities required, e.g.:

```terraform
...
compute = {
general = {
nodes = ["general-0", "general-1"]
compute_init_enable = ["compute", ... ] # see below
}
}
...
```

## Supported appliance functionalities

The string "compute" must be present in the `compute_init_enable` flag to enable
this functionality. The table below shows which other appliance functionalities
are currently supported - use the name in the role column to enable these.

| Playbook | Role (or functionality) | Support |
| -------------------------|-------------------------|-----------------|
| hooks/pre.yml | ? | None at present |
| validate.yml | n/a | Not relevant during boot |
| bootstrap.yml | (wait for ansible-init) | Not relevant during boot |
| bootstrap.yml | resolv_conf | Fully supported |
| bootstrap.yml | etc_hosts | Fully supported |
| bootstrap.yml | proxy | None at present |
| bootstrap.yml | (/etc permissions) | None required - use image build |
| bootstrap.yml | (ssh /home fix) | None required - use image build |
| bootstrap.yml | (system users) | None required - use image build |
| bootstrap.yml | systemd | None required - use image build |
| bootstrap.yml | selinux | None required - use image build |
| bootstrap.yml | sshd | None at present |
| bootstrap.yml | dnf_repos | None at present (requirement TBD) |
| bootstrap.yml | squid | Not relevant for compute nodes |
| bootstrap.yml | tuned | None |
| bootstrap.yml | freeipa_server | Not relevant for compute nodes |
| bootstrap.yml | cockpit | None required - use image build |
| bootstrap.yml | firewalld | Not relevant for compute nodes |
| bootstrap.yml | fail2ban | Not relevant for compute nodes |
| bootstrap.yml | podman | Not relevant for compute nodes |
| bootstrap.yml | update | Not relevant during boot |
| bootstrap.yml | reboot | Not relevant for compute nodes |
| bootstrap.yml | ofed | Not relevant during boot |
| bootstrap.yml | ansible_init (install) | Not relevant during boot |
| bootstrap.yml | k3s (install) | Not relevant during boot |
| hooks/post-bootstrap.yml | ? | None at present |
| iam.yml | freeipa_client | None at present [1] |
| iam.yml | freeipa_server | Not relevant for compute nodes |
| iam.yml | sssd | None at present |
| filesystems.yml | block_devices | None required - role deprecated |
| filesystems.yml | nfs | All client functionality |
| filesystems.yml | manila | All functionality |
| filesystems.yml | lustre | None at present |
| extras.yml | basic_users | All functionality [2] |
| extras.yml | eessi | All functionality [3] |
| extras.yml | cuda | None required - use image build [4] |
| extras.yml | persist_hostkeys | Not expected to be required for compute nodes |
| extras.yml | compute_init (export) | Not relevant for compute nodes |
| extras.yml | k9s (install) | Not relevant during boot |
| extras.yml | extra_packages | None at present. Would require dnf_repos |
| slurm.yml | mysql | Not relevant for compute nodes |
| slurm.yml | rebuild | Not relevant for compute nodes |
| slurm.yml | openhpc [5] | All slurmd-related functionality |
| slurm.yml | (set memory limits) | None at present |
| slurm.yml | (block ssh) | None at present |
| portal.yml | (openondemand server) | Not relevant for compute nodes |
| portal.yml | (openondemand vnc desktop) | None required - use image build |
| portal.yml | (openondemand jupyter server) | None required - use image build |
| monitoring.yml | (all monitoring) | None at present [6] |
| disable-repos.yml | dnf_repos | None at present (requirement TBD) |
| hooks/post.yml | ? | None at present |


Notes:
1. FreeIPA client functionality would be better provided using a client fork
which uses pkinit keys rather than OTP to reenrol nodes.
2. Assumes home directory already exists on shared storage.
3. Assumes `cvmfs_config` is the same on control node and all compute nodes
4. If `cuda` role was run during build, the nvidia-persistenced is enabled
and will start during boot.
5. `openhpc` does not need to be added to `compute_init_enable`, this is
automatically enabled by adding `compute`.
5. Only node-exporter tasks are relevant, and will be done via k3s in a future release.


## Approach
This works as follows:
1. During image build, an ansible-init playbook and supporting files
(e.g. templates, filters, etc) are installed.
Expand All @@ -31,21 +124,7 @@ The check in 4b. above is what prevents the compute-init script from trying
to configure the node before the services on the control node are available
(which requires running the site.yml playbook).

The following roles/groups are currently fully functional:
- `resolv_conf`: all functionality
- `etc_hosts`: all functionality
- `nfs`: client functionality only
- `manila`: all functionality
- `basic_users`: all functionality, assumes home directory already exists on
shared storage
- `eessi`: all functionality, assumes `cvmfs_config` is the same on control
node and all compute nodes.
- `openhpc`: all functionality

The above may be enabled by setting the compute_init_enable property on the
tofu compute variable.

# Development/debugging
## Development/debugging

To develop/debug changes to the compute script without actually having to build
a new image:
Expand Down Expand Up @@ -83,7 +162,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
as in step 3.


# Design notes
## Design notes
- Duplicating code in roles into the `compute-init` script is unfortunate, but
does allow developing this functionality without wider changes to the
appliance.
Expand Down
16 changes: 11 additions & 5 deletions ansible/roles/compute_init/tasks/export.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
file:
path: /exports/cluster
state: directory
owner: root
owner: slurm
group: root
mode: u=rwX,go=
mode: u=rX,g=rwX,o=
run_once: true
delegate_to: "{{ groups['control'] | first }}"

Expand All @@ -23,21 +23,27 @@
file:
path: /exports/cluster/hostvars/{{ inventory_hostname }}/
state: directory
mode: u=rwX,go=
# TODO: owner,mode,etc
owner: slurm
group: root
mode: u=rX,g=rwX,o=
delegate_to: "{{ groups['control'] | first }}"

- name: Template out hostvars
template:
src: hostvars.yml.j2
dest: /exports/cluster/hostvars/{{ inventory_hostname }}/hostvars.yml
mode: u=rw,go=
owner: slurm
group: root
mode: u=r,g=rw,o=
delegate_to: "{{ groups['control'] | first }}"

- name: Copy manila share info to /exports/cluster
copy:
content: "{{ os_manila_mount_share_info_var | to_nice_yaml }}"
dest: /exports/cluster/manila_share_info.yml
owner: root
group: root
mode: u=rw,g=r
run_once: true
delegate_to: "{{ groups['control'] | first }}"
when: os_manila_mount_share_info is defined
Expand Down
30 changes: 30 additions & 0 deletions ansible/roles/rebuild/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
rebuild
=========

Enables reboot tool from https://github.com/stackhpc/slurm-openstack-tools.git to be run from control node.

Requirements
------------

clouds.yaml file

Role Variables
--------------

- `openhpc_rebuild_clouds`: Directory. Path to clouds.yaml file.


Example Playbook
----------------

- hosts: control
become: yes
tasks:
- import_role:
name: rebuild

License
-------

Apache-2.0

2 changes: 2 additions & 0 deletions ansible/roles/rebuild/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
---
openhpc_rebuild_clouds: ~/.config/openstack/clouds.yaml
21 changes: 21 additions & 0 deletions ansible/roles/rebuild/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---

- name: Create /etc/openstack
file:
path: /etc/openstack
state: directory
owner: slurm
group: root
mode: u=rX,g=rwX

- name: Copy out clouds.yaml
copy:
src: "{{ openhpc_rebuild_clouds }}"
dest: /etc/openstack/clouds.yaml
owner: slurm
group: root
mode: u=r,g=rw

- name: Setup slurm tools
include_role:
name: slurm_tools
2 changes: 1 addition & 1 deletion ansible/roles/slurm_stats/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Example Playbook
- hosts: compute
tasks:
- import_role:
name: stackhpc.slurm_openstack_tools.slurm-stats
name: slurm_stats


License
Expand Down
29 changes: 0 additions & 29 deletions ansible/roles/slurm_tools/.travis.yml

This file was deleted.

2 changes: 1 addition & 1 deletion ansible/roles/slurm_tools/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
module_defaults:
ansible.builtin.pip:
virtualenv: /opt/slurm-tools
virtualenv_command: python3 -m venv
virtualenv_command: "{{ 'python3.9 -m venv' if ansible_distribution_major_version == '8' else 'python3 -m venv' }}"
state: latest
become: true
become_user: "{{ pytools_user }}"
10 changes: 10 additions & 0 deletions ansible/slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,16 @@
- include_role:
name: mysql

- name: Setup slurm-driven rebuild
hosts: rebuild:!builder
become: yes
tags:
- rebuild
- openhpc
tasks:
- import_role:
name: rebuild

- name: Setup slurm
hosts: openhpc
become: yes
Expand Down
7 changes: 3 additions & 4 deletions environments/.stackhpc/inventory/extra_groups
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
[basic_users:children]
cluster

[rebuild:children]
control
compute

[etc_hosts:children]
cluster

Expand Down Expand Up @@ -35,3 +31,6 @@ builder
[sssd:children]
# Install sssd into fat image
builder

[rebuild:children]
control
8 changes: 6 additions & 2 deletions environments/.stackhpc/tofu/SMS.tfvars
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
cluster_net = "stackhpc-ipv4-geneve"
cluster_subnet = "stackhpc-ipv4-geneve-subnet"
cluster_networks = [
{
network = "stackhpc-ipv4-geneve"
subnet = "stackhpc-ipv4-geneve-subnet"
}
]
control_node_flavor = "general.v1.small"
other_node_flavor = "general.v1.small"
2 changes: 1 addition & 1 deletion environments/.stackhpc/tofu/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ module "cluster" {
nodes: ["compute-0", "compute-1"]
flavor: var.other_node_flavor
compute_init_enable: ["compute", "etc_hosts", "nfs", "basic_users", "eessi"]
# ignore_image_changes: true
ignore_image_changes: true
}
# Example of how to add another partition:
# extra: {
Expand Down
Loading

0 comments on commit 112aa6e

Please sign in to comment.