Skip to content

Commit

Permalink
Merge pull request #554 from stackhpc/docs/compute-init-roles-v2
Browse files Browse the repository at this point in the history
Document all roles for compute_init_enable
  • Loading branch information
bertiethorpe authored Feb 6, 2025
2 parents 2a0f3b7 + c392db0 commit b6ba905
Showing 1 changed file with 103 additions and 24 deletions.
127 changes: 103 additions & 24 deletions ansible/roles/compute_init/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,104 @@
# EXPERIMENTAL: compute-init

Experimental / in-progress functionality to allow compute nodes to rejoin the
cluster after a reboot.

To enable this add compute nodes (or a subset of them into) the `compute_init`
group.

# EXPERIMENTAL: compute_init

Experimental functionality to allow compute nodes to rejoin the cluster after
a reboot without running the `ansible/site.yml` playbook.

To enable this:
1. Add the `compute` group (or a subset) into the `compute_init` group. This is
the default when using cookiecutter to create an environment, via the
"everything" template.
2. Build an image which includes the `compute_init` group. This is the case
for StackHPC-built release images.
3. Enable the required functionalities during boot, by setting the
`compute_init_enable` property for a compute group in the
OpenTofu `compute` variable to a list which includes "compute", plus the
other roles/functionalities required, e.g.:

```terraform
...
compute = {
general = {
nodes = ["general-0", "general-1"]
compute_init_enable = ["compute", ... ] # see below
}
}
...
```

## Supported appliance functionalities

The string "compute" must be present in the `compute_init_enable` flag to enable
this functionality. The table below shows which other appliance functionalities
are currently supported - use the name in the role column to enable these.

| Playbook | Role (or functionality) | Support |
| -------------------------|-------------------------|-----------------|
| hooks/pre.yml | ? | None at present |
| validate.yml | n/a | Not relevant during boot |
| bootstrap.yml | (wait for ansible-init) | Not relevant during boot |
| bootstrap.yml | resolv_conf | Fully supported |
| bootstrap.yml | etc_hosts | Fully supported |
| bootstrap.yml | proxy | None at present |
| bootstrap.yml | (/etc permissions) | None required - use image build |
| bootstrap.yml | (ssh /home fix) | None required - use image build |
| bootstrap.yml | (system users) | None required - use image build |
| bootstrap.yml | systemd | None required - use image build |
| bootstrap.yml | selinux | None required - use image build |
| bootstrap.yml | sshd | None at present |
| bootstrap.yml | dnf_repos | None at present (requirement TBD) |
| bootstrap.yml | squid | Not relevant for compute nodes |
| bootstrap.yml | tuned | None |
| bootstrap.yml | freeipa_server | Not relevant for compute nodes |
| bootstrap.yml | cockpit | None required - use image build |
| bootstrap.yml | firewalld | Not relevant for compute nodes |
| bootstrap.yml | fail2ban | Not relevant for compute nodes |
| bootstrap.yml | podman | Not relevant for compute nodes |
| bootstrap.yml | update | Not relevant during boot |
| bootstrap.yml | reboot | Not relevant for compute nodes |
| bootstrap.yml | ofed | Not relevant during boot |
| bootstrap.yml | ansible_init (install) | Not relevant during boot |
| bootstrap.yml | k3s (install) | Not relevant during boot |
| hooks/post-bootstrap.yml | ? | None at present |
| iam.yml | freeipa_client | None at present [1] |
| iam.yml | freeipa_server | Not relevant for compute nodes |
| iam.yml | sssd | None at present |
| filesystems.yml | block_devices | None required - role deprecated |
| filesystems.yml | nfs | All client functionality |
| filesystems.yml | manila | All functionality |
| filesystems.yml | lustre | None at present |
| extras.yml | basic_users | All functionality [2] |
| extras.yml | eessi | All functionality [3] |
| extras.yml | cuda | None required - use image build [4] |
| extras.yml | persist_hostkeys | Not expected to be required for compute nodes |
| extras.yml | compute_init (export) | Not relevant for compute nodes |
| extras.yml | k9s (install) | Not relevant during boot |
| extras.yml | extra_packages | None at present. Would require dnf_repos |
| slurm.yml | mysql | Not relevant for compute nodes |
| slurm.yml | rebuild | Not relevant for compute nodes |
| slurm.yml | openhpc [5] | All slurmd-related functionality |
| slurm.yml | (set memory limits) | None at present |
| slurm.yml | (block ssh) | None at present |
| portal.yml | (openondemand server) | Not relevant for compute nodes |
| portal.yml | (openondemand vnc desktop) | None required - use image build |
| portal.yml | (openondemand jupyter server) | None required - use image build |
| monitoring.yml | (all monitoring) | None at present [6] |
| disable-repos.yml | dnf_repos | None at present (requirement TBD) |
| hooks/post.yml | ? | None at present |


Notes:
1. FreeIPA client functionality would be better provided using a client fork
which uses pkinit keys rather than OTP to reenrol nodes.
2. Assumes home directory already exists on shared storage.
3. Assumes `cvmfs_config` is the same on control node and all compute nodes
4. If `cuda` role was run during build, the nvidia-persistenced is enabled
and will start during boot.
5. `openhpc` does not need to be added to `compute_init_enable`, this is
automatically enabled by adding `compute`.
5. Only node-exporter tasks are relevant, and will be done via k3s in a future release.


## Approach
This works as follows:
1. During image build, an ansible-init playbook and supporting files
(e.g. templates, filters, etc) are installed.
Expand All @@ -31,21 +124,7 @@ The check in 4b. above is what prevents the compute-init script from trying
to configure the node before the services on the control node are available
(which requires running the site.yml playbook).

The following roles/groups are currently fully functional:
- `resolv_conf`: all functionality
- `etc_hosts`: all functionality
- `nfs`: client functionality only
- `manila`: all functionality
- `basic_users`: all functionality, assumes home directory already exists on
shared storage
- `eessi`: all functionality, assumes `cvmfs_config` is the same on control
node and all compute nodes.
- `openhpc`: all functionality

The above may be enabled by setting the compute_init_enable property on the
tofu compute variable.

# Development/debugging
## Development/debugging

To develop/debug changes to the compute script without actually having to build
a new image:
Expand Down Expand Up @@ -83,7 +162,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
as in step 3.


# Design notes
## Design notes
- Duplicating code in roles into the `compute-init` script is unfortunate, but
does allow developing this functionality without wider changes to the
appliance.
Expand Down

0 comments on commit b6ba905

Please sign in to comment.