Skip to content

Commit

Permalink
Merge pull request #275 from stackhpc/feat/partition-ucx-dev
Browse files Browse the repository at this point in the history
Allow defining UCX device per partition for hpctests
  • Loading branch information
sjpb authored May 12, 2023
2 parents 6b702e1 + f26cfe2 commit 999cfc8
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 3 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,13 @@ jobs:
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform apply -auto-approve
- name: Delete infrastructure if failed due to lack of hosts
- name: Delete infrastructure if provisioning failed
run: |
. venv/bin/activate
. environments/.stackhpc/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform destroy -auto-approve
if: ${{ steps.provision_servers.outcome == 'failure' }}
if: failure() && steps.provision_servers.outcome == 'failure'

- name: Configure cluster
run: |
Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/hpctests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Role Variables
- `hpctests_rootdir`: Required. Path to root of test directory tree, which must be on a r/w filesystem shared to all cluster nodes under test. The last directory component will be created.
- `hpctests_partition`: Optional. Name of partition to use, otherwise default partition is used.
- `hpctests_nodes`: Optional. A Slurm node expression, e.g. `'compute-[0-15,19]'` defining the nodes to use. If not set all nodes in the selected partition are used.
- `hpctests_ucx_net_devices`: Optional. Control which network device/interface to use, e.g. `mlx5_1:0`. The default of `all` (as per UCX) may not be appropriate for multi-rail nodes with different bandwidths on each device. See [here](https://openucx.readthedocs.io/en/master/faq.html#what-is-the-default-behavior-in-a-multi-rail-environment) and [here](https://github.com/openucx/ucx/wiki/UCX-environment-parameters#setting-the-devices-to-use).
- `hpctests_ucx_net_devices`: Optional. Control which network device/interface to use, e.g. `mlx5_1:0`. The default of `all` (as per UCX) may not be appropriate for multi-rail nodes with different bandwidths on each device. See [here](https://openucx.readthedocs.io/en/master/faq.html#what-is-the-default-behavior-in-a-multi-rail-environment) and [here](https://github.com/openucx/ucx/wiki/UCX-environment-parameters#setting-the-devices-to-use). Alternatively a mapping of partition name (as `hpctests_partition`) to device/interface can be used. For partitions not defined in the mapping the default of `all` is used.
- `hpctests_outdir`: Optional. Directory to use for test output on local host. Defaults to `$HOME/hpctests` (for local user).
- `hpctests_hpl_NB`: Optional, default 192. The HPL block size "NB" - for Intel CPUs see [here](https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/intel-oneapi-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html).
- `hpctests_hpl_mem_frac`: Optional, default 0.8. The HPL problem size "N" will be selected to target using this fraction of each node's memory.
Expand Down
5 changes: 5 additions & 0 deletions ansible/roles/hpctests/tasks/setup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,8 @@
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
become: true

- name: Set fact for UCX_NET_DEVICES
set_fact:
hpctests_ucx_net_devices: "{{ hpctests_ucx_net_devices.get(hpctests_partition, 'all') }}"
when: hpctests_ucx_net_devices is mapping

0 comments on commit 999cfc8

Please sign in to comment.