Skip to content

Commit 7947368

Browse files
committed
default to unlock action and simplify things
1 parent c0f29f3 commit 7947368

File tree

8 files changed

+53
-50
lines changed

8 files changed

+53
-50
lines changed

.github/workflows/stackhpc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ jobs:
166166
run: |
167167
. venv/bin/activate
168168
. environments/.stackhpc/activate
169-
ansible-playbook --limit login,control ansible/adhoc/lock-unlock-instances.yml -e "lock_unlock_action=unlock"
169+
ansible-playbook ansible/adhoc/unlock.yml
170170
cd "$STACKHPC_TF_DIR"
171171
tofu init
172172
tofu apply -auto-approve -var-file="${{ env.CI_CLOUD }}.tfvars"
@@ -257,7 +257,7 @@ jobs:
257257
run: |
258258
. venv/bin/activate
259259
. environments/.stackhpc/activate
260-
ansible-playbook ansible/adhoc/lock-unlock-instances.yml -e "lock_unlock_action=unlock"
260+
ansible-playbook ansible/adhoc/unlock.yml
261261
cd "$STACKHPC_TF_DIR"
262262
tofu destroy -auto-approve -var-file="${{ env.CI_CLOUD }}.tfvars" || echo "tofu failed in $STACKHPC_TF_DIR"
263263
if: ${{ success() || cancelled() }}

ansible/adhoc/lock-unlock-instances.yml

Lines changed: 0 additions & 27 deletions
This file was deleted.

ansible/adhoc/rebuild-via-slurm.yml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,18 @@
88

99
# See docs/slurm-controlled-rebuild.md.
1010

11-
- name: Unlock compute instances for rebuild
11+
- name: Unlock compute instances
1212
vars:
13-
lock_unlock_action: unlock
14-
lock_unlock_hosts: compute
15-
ansible.builtin.import_playbook: lock-unlock-instances.yml
13+
unlock_hosts: compute
14+
ansible.builtin.import_playbook: unlock.yml
1615

1716
- hosts: login
1817
run_once: true
1918
gather_facts: false
2019
tasks:
21-
- name: Run slurm-controlled rebuild
20+
- name: Start slurm-controlled rebuild
2221
ansible.builtin.import_role:
2322
name: rebuild
2423
tasks_from: rebuild.yml
24+
25+
# TODO: how do we lock the compute nodes again??

ansible/adhoc/rebuild.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
---
22
# Rebuild hosts with a specified image from OpenStack.
33
#
4+
# NB: This is provided for development use only, and is not normally required
5+
# NB: Run the unlock playbook as shown below first
6+
#
47
# Use ansible's -v output to see output.
58
# Use --limit to control which hosts to rebuild (either specific hosts or the <cluster_name>_<partition_name> groups defining partitions).
69
# Optionally, supply `-e rebuild_image=<image_name_or_id>` to define a specific image, otherwise the current image is reused.
710
#
8-
# After running site.yml, all instances are locked, so to run the rebuild.yml, the unlock playbook must be run:
9-
# ansible-playbook ansible/adhoc/lock-unlock-instances.yml -e "lock_unlock_action=unlock"
10-
# Similarly to rebuild, --limit can be used to control which hosts to unlock.
11-
#
1211
# NOTE: If a hostvar `instance_id` is defined this is used to select hosts.
1312
# Otherwise the hostname is used and this must be unique, which may not be the case e.g. if using identically-named staging and production hosts.
1413
#
15-
# Example:
16-
# ansible-playbook -v --limit ohpc_compute ansible/adhoc/rebuild.yml -e rebuild_image=openhpc_v2.3
14+
# Example of just rebuilding login nodes back to current image:
15+
# ansible-playbook -v --limit login ansible/adhoc/unlock.yml
16+
# ansible-playbook -v --limit login ansible/adhoc/rebuild.yml
1717

1818
- hosts: cluster
1919
become: false

ansible/adhoc/unlock.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
# Lock or unlock cluster instances
3+
4+
# Variable `unlock_action` may be 'unlock' (default) or 'lock'
5+
6+
# A manual run is required before any action which will modify instances:
7+
# - ansible-playbook ansible/adhoc/rebuild.yml
8+
# - tofu apply
9+
# - tofu destroy
10+
# e.g.:
11+
# ansible-playbook ansible/adhoc/lock.yml
12+
#
13+
# Instances are automatically locked by site.yml and unlocked by ansible/adhoc/rebuild-via-slurm.yml
14+
15+
- hosts: "{{ unlock_hosts | default('cluster') }}"
16+
gather_facts: false
17+
become: false
18+
tasks:
19+
- name: "{{ unlock_action | default('unlock') | capitalize }} instances"
20+
openstack.cloud.server_action:
21+
action: "{{ unlock_action | default('unlock') }}"
22+
server: "{{ instance_id }}"
23+
delegate_to: localhost

ansible/site.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22

33
- ansible.builtin.import_playbook: check-production.yml
44

5-
- name: Lock cluster instances
6-
ansible.builtin.import_playbook: adhoc/lock-unlock-instances.yml
5+
- ansible.builtin.import_playbook: adhoc/unlock.yml
6+
vars:
7+
unlock_action: lock
78

89
- name: Run pre.yml hook
910
vars:

docs/experimental/slurm-controlled-rebuild.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ In summary, the way this functionality works is as follows:
1212

1313
1. The image references(s) are manually updated in the OpenTofu configuration
1414
in the normal way.
15-
2. The adhoc playbook `lock-unlock-instances.yml` is run limited to control and login
16-
nodes, with `lock_unlock_action=unlock` to allow the nodes to be rebuilt.
15+
2. The adhoc playbook `unlock.yml` is run to allow the login and control node
16+
instances to be modofied.
1717
3. `tofu apply` is run which rebuilds the login and control nodes to the new
1818
image(s). The new image reference for compute nodes is ignored, but is
1919
written into the hosts inventory file (and is therefore available as an
@@ -27,10 +27,11 @@ In summary, the way this functionality works is as follows:
2727
- Configures an application credential and helper programs on the control
2828
node, using the [rebuild](../../ansible/roles/rebuild/README.md) role.
2929
5. An admin submits Slurm jobs, one for each node, to a special "rebuild"
30-
partition using the adhoc playbook `rebuild-via-slurm.yml`. Because this partition
31-
has higher priority than the partitions normal users can use, these rebuild jobs
32-
become the next job in the queue for every node (although any jobs currently running
33-
will complete as normal).
30+
partition using the adhoc playbook `rebuild-via-slurm.yml` which also unlocks
31+
the compute instances. Because this partition has higher priority than the
32+
partitions normal users can use, these rebuild jobs become the next job in
33+
the queue for every node (although any jobs currently running will complete
34+
as normal).
3435
6. Because these rebuild jobs have the `--reboot` flag set, before launching them
3536
the Slurm control node runs a [RebootProgram](https://slurm.schedmd.com/slurm.conf.html#OPT_RebootProgram)
3637
which compares the current image for the node to the one in the cluster
@@ -45,6 +46,10 @@ In summary, the way this functionality works is as follows:
4546
registers the node as having finished rebooting. It then launches the actual
4647
job, which does not do anything.
4748

49+
TODO: need to relock the instances afterwards!
50+
TODO: mention the playbook to check rebuild state?
51+
TODO: maybe we should default to only unlocking the compute/login?
52+
4853
## Prerequsites
4954

5055
To enable a compute node to rejoin the cluster after a rebuild, functionality

docs/operations.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -212,9 +212,9 @@ ansible-playbook ansible/adhoc/$PLAYBOOK
212212
Currently they include the following (see each playbook for links to documentation):
213213

214214
- `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance.
215-
- `lock-unlock-instances.yml`: Lock cluster instances for preventing tofu changes, or unlock to allow changes.
215+
- `unlock.yml`: Unlock or lock nodes to allow/prevent changes to instances.
216216
- `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for re-imaging nodes on an in-production cluster).
217-
Requires `lock-unlock-instances.yml` be run first.
217+
Requires `unlock.yml` be run first.
218218
- `restart-slurm.yml`: Restart all Slurm daemons in the correct order.
219219
- `update-packages.yml`: Update specified packages on cluster nodes (NB: not recommended for routine use).
220220

0 commit comments

Comments
 (0)