Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document all roles for compute_init_enable #554

Merged
merged 2 commits into from
Feb 6, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 103 additions & 24 deletions ansible/roles/compute_init/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,104 @@
# EXPERIMENTAL: compute-init

Experimental / in-progress functionality to allow compute nodes to rejoin the
cluster after a reboot.

To enable this add compute nodes (or a subset of them into) the `compute_init`
group.

# EXPERIMENTAL: compute_init

Experimental functionality to allow compute nodes to rejoin the cluster after
a reboot without running the `ansible/site.yml` playbook.

To enable this:
1. Add the `compute` group (or a subset) into the `compute_init` group. This is
the default when using cookiecutter to create an environment, via the
"everything" template.
2. Build an image which includes the `compute_init` group. This is the case
for StackHPC-built release images.
3. Enable the required functionalities during boot, by setting the
`compute_init_enable` property for a compute group in the
OpenTofu `compute` variable to a list which includes "compute", plus the
other roles/functionalities required, e.g.:

```terraform
...
compute = {
general = {
nodes = ["general-0", "general-1"]
compute_init_enable = ["compute", ... ] # see below
}
}
...
```

## Supported appliance functionalities

The string "compute" must be present in the `compute_init_enable` flag to enable
this functionality. The table below shows which other appliance functionalities
are currently supported - use the name in the role column to enable these.

| Playbook | Role (or functionality) | Support |
| -------------------------|-------------------------|-----------------|
| hooks/pre.yml | ? | None at present |
| validate.yml | n/a | Not relevant during boot |
| bootstrap.yml | (wait for ansible-init) | Not relevant during boot |
| bootstrap.yml | resolv_conf | Fully supported |
| bootstrap.yml | etc_hosts | Fully supported |
| bootstrap.yml | proxy | None at present |
| bootstrap.yml | (/etc permissions) | None required - use image build |
| bootstrap.yml | (ssh /home fix) | None required - use image build |
| bootstrap.yml | (system users) | None required - use image build |
| bootstrap.yml | systemd | None required - use image build |
| bootstrap.yml | selinux | None required - use image build |
| bootstrap.yml | sshd | None at present |
| bootstrap.yml | dnf_repos | None at present (requirement TBD) |
| bootstrap.yml | squid | Not relevant for compute nodes |
| bootstrap.yml | tuned | None |
| bootstrap.yml | freeipa_server | Not relevant for compute nodes |
| bootstrap.yml | cockpit | None required - use image build |
| bootstrap.yml | firewalld | Not relevant for compute nodes |
| bootstrap.yml | fail2ban | Not relevant for compute nodes |
| bootstrap.yml | podman | Not relevant for compute nodes |
| bootstrap.yml | update | Not relevant during boot |
| bootstrap.yml | reboot | Not relevant for compute nodes |
| bootstrap.yml | ofed | Not relevant during boot |
| bootstrap.yml | ansible_init (install) | Not relevant during boot |
| bootstrap.yml | k3s (install) | Not relevant during boot |
| hooks/post-bootstrap.yml | ? | None at present |
| iam.yml | freeipa_client | None at present [1] |
| iam.yml | freeipa_server | Not relevant for compute nodes |
| iam.yml | sssd | None at present |
| filesystems.yml | block_devices | None required - role deprecated |
| filesystems.yml | nfs | All client functionality |
| filesystems.yml | manila | All functionality |
| filesystems.yml | lustre | None at present |
| extras.yml | basic_users | All functionality [2] |
| extras.yml | eessi | All functionality [3] |
| extras.yml | cuda | None required - use image build [4] |
| extras.yml | persist_hostkeys | Not expected to be required for compute nodes |
| extras.yml | compute_init (export) | Not relevant for compute nodes |
| extras.yml | k9s (install) | Not relevant during boot |
| extras.yml | extra_packages | None at present. Would require dnf_repos |
| slurm.yml | mysql | Not relevant for compute nodes |
| slurm.yml | rebuild | Not relevant for compute nodes |
| slurm.yml | openhpc [5] | All slurmd-related functionality |
| slurm.yml | (set memory limits) | None at present |
| slurm.yml | (block ssh) | None at present |
| portal.yml | (openondemand server) | Not relevant for compute nodes |
| portal.yml | (openondemand vnc desktop) | None required - use image build |
| portal.yml | (openondemand jupyter server) | None required - use image build |
| monitoring.yml | (all monitoring) | None at present [6] |
| disable-repos.yml | dnf_repos | None at present (requirement TBD) |
| hooks/post.yml | ? | None at present |


Notes:
1. FreeIPA client functionality would be better provided using a client fork
which uses pkinit keys rather than OTP to reenrol nodes.
2. Assumes home directory already exists on shared storage.
3. Assumes `cvmfs_config` is the same on control node and all compute nodes
4. If `cuda` role was run during build, the nvidia-persistenced is enabled
and will start during boot.
5. `openhpc` does not need to be added to `compute_init_enable`, this is
automatically enabled by adding `compute`.
5. Only node-exporter tasks are relevant, and will be done via k3s in a future release.


## Approach
This works as follows:
1. During image build, an ansible-init playbook and supporting files
(e.g. templates, filters, etc) are installed.
Expand All @@ -31,21 +124,7 @@ The check in 4b. above is what prevents the compute-init script from trying
to configure the node before the services on the control node are available
(which requires running the site.yml playbook).

The following roles/groups are currently fully functional:
- `resolv_conf`: all functionality
- `etc_hosts`: all functionality
- `nfs`: client functionality only
- `manila`: all functionality
- `basic_users`: all functionality, assumes home directory already exists on
shared storage
- `eessi`: all functionality, assumes `cvmfs_config` is the same on control
node and all compute nodes.
- `openhpc`: all functionality

The above may be enabled by setting the compute_init_enable property on the
tofu compute variable.

# Development/debugging
## Development/debugging

To develop/debug changes to the compute script without actually having to build
a new image:
Expand Down Expand Up @@ -83,7 +162,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
as in step 3.


# Design notes
## Design notes
- Duplicating code in roles into the `compute-init` script is unfortunate, but
does allow developing this functionality without wider changes to the
appliance.
Expand Down