Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make demo deployments easier without support #211

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 47 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,145 +2,88 @@

# StackHPC Slurm Appliance

This repository contains playbooks and configuration to define a Slurm-based HPC environment including:
- A Rocky Linux 8 and OpenHPC v2-based Slurm cluster.
- Shared fileystem(s) using NFS (with servers within or external to the cluster).
- Slurm accounting using a MySQL backend.
- A monitoring backend using Prometheus and ElasticSearch.
- Grafana with dashboards for both individual nodes and Slurm jobs.
- Production-ready Slurm defaults for access and memory.
- A Packer-based build pipeline for compute and login node images.
This repository contains [Ansible](https://www.ansible.com/) playbooks and configuration to define a Slurm-based HPC environment including:
- A [RockyLinux](https://rockylinux.org/) 8.x and [OpenHPC](https://openhpc.community/) v2-based [Slurm](https://slurm.schedmd.com/) cluster with production-ready defaults for access, memory, etc.
- Shared fileystem(s), by default using NFS (optionally over RDMA).
- Slurm accounting using a [MySQL](https://www.mysql.com/) backend.
- Integrated monitoring providing per-job and per-node dashboards, using a [Prometheus](https://prometheus.io/) + [ElasticSearch](https://www.elastic.co/) + [Grafana](https://grafana.com/grafana/) stack.
- A [Packer](https://packer.io/) build pipeline for node images.

The repository is designed to be forked for a specific use-case/HPC site but can contain multiple environments (e.g. development, staging and production). It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us!
This repository is expected to be forked for a specific site and can contain multiple environments (e.g. development, staging and production). It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs to us!

While it is tested on OpenStack it should work on any cloud, except for node rebuild/reimaging features which are currently OpenStack-specific.
Currently, the Slurm Appliance requires an [OpenStack](https://www.openstack.org/) cloud for full functionality, although it can be deployed on other clouds or unmanaged servers.

## Prerequisites
It is recommended to check the following before starting:
- You have root access on the "ansible deploy host" which will be used to deploy the appliance.
- You can create instances using a Rocky 8 GenericCloud image (or an image based on that).
- SSH keys get correctly injected into instances.
- Instances have access to internet (note proxies can be setup through the appliance if necessary).
- DNS works (if not this can be partially worked around but additional configuration will be required).
- Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance).
## Quickstart
This section demonstrates creating an Appliance with default configuration on VM instances with no floating IPs. See the full [Configuration](docs/configuration.md) guide for options.

## Installation on deployment host
Prerequsites:
- An OpenStack project with access to a RockyLinux 8.x GenericCloud image (or image based on that).
- A network and subnet in the project with routing for internet access.
- A RockyLinux 8.x instance on that network to be the "deploy host", with root access.
- An SSH keypair in OpenStack, with the private part on the deploy machine.
- OpenStack credentials on the deploy host.

These instructions assume the deployment host is running Rocky Linux 8:
Note that most of these can be relaxed with additional configuration.

sudo yum install -y git python38
git clone https://github.com/stackhpc/ansible-slurm-appliance
cd ansible-slurm-appliance
/usr/bin/python3.8 -m venv venv
. venv/bin/activate
pip install -U pip
pip install -r requirements.txt
# Install ansible dependencies ...
ansible-galaxy role install -r requirements.yml -p ansible/roles
ansible-galaxy collection install -r requirements.yml -p ansible/collections # ignore the path warning here
1. Configure a deployment host (assuming RockyLinux 8.x):

sudo yum install -y git python38
git clone https://github.com/stackhpc/ansible-slurm-appliance # NB: consider forking this if not just a demo
cd ansible-slurm-appliance
. dev/setup-env

## Overview of directory structure
This activates a Python virtualenv containing the required software - to reactivate later use:

- `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates.
- `ansible/`: Contains the ansible playbooks to configure the infrastruture.
- `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information.
- `dev/`: Contains development tools.
source venv/bin/activate

## Environments
1. Create a new environment for your cluster:

### Overview
cd environments/
cookiecutter skeleton

An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in `environments/`, containing:
- Any deployment automation required - e.g. Terraform configuration or HEAT templates.
- An ansible `inventory/` directory.
- An `activate` script which sets environment variables to point to this configuration.
- Optionally, additional playbooks in `/hooks` to run before or after the main tasks.
And follow the prompts for the name and description

All environments load the inventory from the `common` environment first, with the environment-specific inventory then overriding parts of this as required.
1. Activate the new environment:

### Creating a new environment

This repo contains a `cookiecutter` template which can be used to create a new environment from scratch. Run the [installation on deployment host](#Installation-on-deployment-host) instructions above, then in the repo root run:

. venv/bin/activate
cd environments
cookiecutter skeleton

and follow the prompts to complete the environment name and description.

Alternatively, you could copy an existing environment directory.

Now add deployment automation if required, and then complete the environment-specific inventory as described below.

### Environment-specific inventory structure
source environments/<environment>/activate

The ansible inventory for the environment is in `environments/<environment>/inventory/`. It should generally contain:
- A `hosts` file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc.
- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality.
- Optionally, group variable files in `group_vars/<group_name>/overrides.yml`, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined in `environments/common/inventory/group_vars/all/<group_name>.yml` (the use of `all` here is due to ansible's precedence rules).
1. Configure Terraform for the target cloud:

Although most of the inventory uses the group convention described above there are a few special cases:
- The `control`, `login` and `compute` groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-out `hosts` file.
- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an `[all:vars]` section in the `hosts` file is usually convenient.
- `environments/common/inventory/group_vars/all/defaults.yml` contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using a `environments/<environment>/inventory/group_vars/all/overrides.yml` file.
- The `ansible/adhoc/generate-passwords.yml` playbook sets secrets for all hosts in `environments/<environent>/inventory/group_vars/all/secrets.yml`.
- The Packer-based pipeline for building compute images creates a VM in groups `builder` and `compute`, allowing build-specific properties to be set in `environments/common/inventory/group_vars/builder/defaults.yml` or the equivalent inventory-specific path.
- Each Slurm partition must have:
- An inventory group `<cluster_name>_<partition_name>` defining the hosts it contains - these must be homogenous w.r.t CPU and memory.
- An entry in the `openhpc_slurm_partitions` mapping in `environments/<environment>/inventory/group_vars/openhpc/overrides.yml`.
See the [openhpc role documentation](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) for more options.
- On an OpenStack cloud, rebuilding/reimaging compute nodes from Slurm can be enabled by defining a `rebuild` group containing the relevant compute hosts (e.g. in the generated `hosts` file).
Modify `environments/<environment>/terraform/terraform.tfvars` following instructions in that file.

## Creating a Slurm appliance
1. Install Terraform following instructions [here](https://learn.hashicorp.com/tutorials/terraform/install-cli).

NB: This section describes generic instructions - check for any environment-specific instructions in `environments/<environment>/README.md` before starting.
1. Initialise Terraform:

1. Activate the environment - this **must be done** before any other commands are run:
cd environments/<environment>/terraform/
terraform init

source environments/<environment>/activate
1. Deploy instances:

2. Deploy instances - see environment-specific instructions.
terraform apply

3. Generate passwords:
1. Generate system passwords:

ansible-playbook ansible/adhoc/generate-passwords.yml

This will output a set of passwords in `environments/<environment>/inventory/group_vars/all/secrets.yml`. It is recommended that these are encrpyted and then commited to git using:
This will output a set of passwords in `environments/<environment>/inventory/group_vars/all/secrets.yml`. For production use it is recommended that these are encrpyted and then commited to git using:

ansible-vault encrypt inventory/group_vars/all/secrets.yml

See the [Ansible vault documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) for more details.

4. Deploy the appliance:
1. Set a password for the demo user `testuser`:

export TEST_USER_PASSWORD='mysupersecretpassword'

1. Deploy the appliance:

ansible-playbook ansible/site.yml

or if you have encrypted secrets use:

ansible-playbook ansible/site.yml --ask-vault-password

Tags as defined in the various sub-playbooks defined in `ansible/` may be used to only run part of the `site` tasks.

5. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using:

ansible-playbook ansible/adhoc/<playbook name>

Currently they include the following (see each playbook for links to documentation):
- `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance.
- `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster - see `ansible/roles/rebuild` for that).
- `restart-slurm.yml`: Restart all Slurm daemons in the correct order.
- `update-packages.yml`: Update specified packages on cluster nodes.

## Adding new functionality
Please contact us for specific advice, but in outline this generally involves:
- Adding a role.
- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
- Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
- Updating READMEs.

## Monitoring and logging

Please see the [monitoring-and-logging.README.md](docs/monitoring-and-logging.README.md) for details.
TODO UPDATE THIS:
You can now ssh into your cluster as user `rocky` - IP addresses will be listed in `environments/<environment>/inventory/hosts`. Note this cluster has an NFS-shared `/home` but the `rocky` user's home is `/var/lib/rocky`.
2 changes: 1 addition & 1 deletion ansible/roles/openondemand/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ The OIDC provider should be configured to redirect to `https://{{ openondemand_s


#### Basic/PAM authentication
This option uses HTTP Basic Authentication (i.e. browser prompt) to get a username and password. This is then checked against an existing local user using PAM. Note that HTTPS is configured by default, so the password is protected in transit, although there are [other](https://security.stackexchange.com/a/990) security concerns with Basic Authentication.
This option uses HTTP Basic Authentication (i.e. browser prompt) to get a username and password. This is then checked against an existing local user using PAM. Local users could be defined using e.g. the `basic_users` role. Note that HTTPS is configured by default, so the password is protected in transit, although there are [other](https://security.stackexchange.com/a/990) security concerns with Basic Authentication.

No other authentication options are required for this method.

Expand Down
Loading