-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support compute node rebuild/reboot via Slurm RebootProgram (#553)
* add rebuild role to appliance and modify groupvars * improve readability of group_vars * Define login nodes using an opentofu module (#547) * define login nodes using tf module * Apply suggestions from code review Co-authored-by: Matt Anson <[email protected]> * tweak README to explain compute groups * try to clarify login/compute groups --------- Co-authored-by: Matt Anson <[email protected]> * Change docs/ references from Terraform to OpenTofu (#544) * change terraform references to opentofu in docs * remove wider reference to terraform * Update environments/README.md Co-authored-by: Steve Brasier <[email protected]> * Update environments/common/README.md Co-authored-by: Steve Brasier <[email protected]> --------- Co-authored-by: Steve Brasier <[email protected]> * fix instance_id in compute inventory to be target image, not deployed image * review all roles for compute_init_enable * fix permissions to /exports/cluster * make openhpc_config more greppable * Set ResumeTimeout and ReturnToService overrides in group_vars * CI tests for reboot via slurm (without rebuild) * fpinrocky 8 pythoolsvenv version * refining comments and task names * rebuild role readme --------- Co-authored-by: Steve Brasier <[email protected]> Co-authored-by: Matt Anson <[email protected]> Co-authored-by: Steve Brasier <[email protected]>
- Loading branch information
1 parent
7c831c7
commit 112aa6e
Showing
20 changed files
with
233 additions
and
72 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,3 +80,5 @@ roles/* | |
!roles/slurm_stats/** | ||
!roles/pytools/ | ||
!roles/pytools/** | ||
!roles/rebuild/ | ||
!roles/rebuild/** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Reboot compute nodes via slurm. Nodes will be rebuilt if `image_id` in inventory is different to the currently-provisioned image. | ||
# Example: | ||
# ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml | ||
|
||
- hosts: login | ||
run_once: true | ||
become: yes | ||
gather_facts: no | ||
tasks: | ||
- name: Submit a Slurm job to reboot compute nodes | ||
ansible.builtin.shell: | | ||
set -e | ||
srun --reboot -N 2 uptime | ||
become_user: root | ||
register: slurm_result | ||
failed_when: slurm_result.rc != 0 | ||
|
||
- name: Fetch Slurm controller logs if reboot fails | ||
ansible.builtin.shell: | | ||
journalctl -u slurmctld --since "10 minutes ago" | tail -n 50 | ||
become_user: root | ||
register: slurm_logs | ||
when: slurm_result.rc != 0 | ||
delegate_to: "{{ groups['control'] | first }}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
rebuild | ||
========= | ||
|
||
Enables reboot tool from https://github.com/stackhpc/slurm-openstack-tools.git to be run from control node. | ||
|
||
Requirements | ||
------------ | ||
|
||
clouds.yaml file | ||
|
||
Role Variables | ||
-------------- | ||
|
||
- `openhpc_rebuild_clouds`: Directory. Path to clouds.yaml file. | ||
|
||
|
||
Example Playbook | ||
---------------- | ||
|
||
- hosts: control | ||
become: yes | ||
tasks: | ||
- import_role: | ||
name: rebuild | ||
|
||
License | ||
------- | ||
|
||
Apache-2.0 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
--- | ||
openhpc_rebuild_clouds: ~/.config/openstack/clouds.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
--- | ||
|
||
- name: Create /etc/openstack | ||
file: | ||
path: /etc/openstack | ||
state: directory | ||
owner: slurm | ||
group: root | ||
mode: u=rX,g=rwX | ||
|
||
- name: Copy out clouds.yaml | ||
copy: | ||
src: "{{ openhpc_rebuild_clouds }}" | ||
dest: /etc/openstack/clouds.yaml | ||
owner: slurm | ||
group: root | ||
mode: u=r,g=rw | ||
|
||
- name: Setup slurm tools | ||
include_role: | ||
name: slurm_tools |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,8 @@ | ||
cluster_net = "stackhpc-ipv4-geneve" | ||
cluster_subnet = "stackhpc-ipv4-geneve-subnet" | ||
cluster_networks = [ | ||
{ | ||
network = "stackhpc-ipv4-geneve" | ||
subnet = "stackhpc-ipv4-geneve-subnet" | ||
} | ||
] | ||
control_node_flavor = "general.v1.small" | ||
other_node_flavor = "general.v1.small" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.