Skip to content

Commit

Permalink
refactor: Update nvidia-gpu-efa pattern (#1966)
Browse files Browse the repository at this point in the history
  • Loading branch information
iankouls-aws committed Aug 4, 2024
1 parent ff0c8ab commit 6facfa5
Show file tree
Hide file tree
Showing 4 changed files with 360 additions and 45 deletions.
2 changes: 2 additions & 0 deletions patterns/nvidia-gpu-efa/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
efa-info-test.yaml
efa-nccl-test.yaml
181 changes: 136 additions & 45 deletions patterns/nvidia-gpu-efa/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# EKS Cluster w/ NVIDIA GPUs and EFA for Machine Learning

This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that utilizes `p5.48xlarge` instances with H100 NVIDIA GPUs used in distributed, multi-node machine learning workloads.
This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that utilizes `p5.48xlarge` instances with H100 NVIDIA GPUs used in distributed, multi-node machine learning.

The following components are demonstrated in this pattern:

Expand Down Expand Up @@ -33,30 +33,36 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started

!!! note

The following steps are shown with `g5.8xlarge` for frugality. Values shown below will change based on the instance type selected (i.e. - `p5.48xlarge` has 8 GPUs and 32 EFA interfaces)
Desired instance type can be specified in [eks.tf](eks.tf#L36).
Values shown below will change based on the instance type selected (i.e. - `p5.48xlarge` has 8 GPUs and 32 EFA interfaces).
A list of EFA-enabled instance types is available [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).
If you are using an on-demand capacity reservation (ODCR) for your instance type, please uncomment the `capacity_reservation_specification` block in `eks.tf`
and specify a capacity_reservation_id. Please ensure that the region and availability zone of your ODCR match the ones used in `main.tf`.

1. List the nodes by instance type:
1. List the nodes and their instance type:

```sh
kubectl get nodes -o yaml | grep instance-type | grep node | grep -v f:
kubectl get nodes -L node.kubernetes.io/instance-type
```

```text
node.kubernetes.io/instance-type: g5.8xlarge
node.kubernetes.io/instance-type: m5.large
node.kubernetes.io/instance-type: m5.large
node.kubernetes.io/instance-type: g5.8xlarge
NAME STATUS ROLES AGE VERSION INSTANCE-TYPE
ip-10-0-1-16.us-east-2.compute.internal Ready <none> 12h v1.29.3-eks-ae9a62a p5.48xlarge
ip-10-0-12-113.us-east-2.compute.internal Ready <none> 14h v1.29.3-eks-ae9a62a m5.large
ip-10-0-12-201.us-east-2.compute.internal Ready <none> 12h v1.29.3-eks-ae9a62a p5.48xlarge
ip-10-0-46-217.us-east-2.compute.internal Ready <none> 14h v1.29.3-eks-ae9a62a m5.large
```

You should see two EFA-enabled (in this example `g5.8xlarge`) nodes in the list.
You should see two EFA-enabled (in this example `p5.48xlarge`) nodes in the list.

2. Deploy Kubeflow MPI Operator

Kubeflow MPI Operator is required for running MPIJobs on EKS. We will use an MPIJob to test EFA.
To deploy the MPI operator execute the following:

```sh
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.3.0/deploy/v2beta1/mpi-operator.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml
```

```text
Expand All @@ -82,80 +88,165 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
clusterrole.rbac.authorization.k8s.io/mpi-operator configured
```

3. EFA test
3. EFA info test

The results should shown that two EFA adapters are available (one for each worker pod)
This test prints a list of available EFA interfaces by using the `/opt/amazon/efa/bin/fi_info` utility.
The script [generate-efa-info-test.sh](generate-efa-info-test.sh) creates an MPIJob manifest file named `efa-info-test.yaml`. It assumes that there are two cluster nodes with 8 GPU's per node and 32 EFA adapters. If you are not using `p5.48xlarge` instances in your cluster, you may adjust the settings in the script prior to running it.
`NUM_WORKERS` - number of nodes you want to run the test on
`GPU_PER_WORKER` - number of GPUs available on each node
`EFA_PER_WORKER` - number of EFA interfaces available on each node
```sh
./generate-efa-info-test.sh
```
To start the test apply the generated manifest to the cluster:
```sh
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/efa-device-plugin/test-efa.yaml
kubectl apply -f ./efa-info-test.yaml
```
```text
mpijob.kubeflow.org/efa-info-test created
```
Observe the pods in the current namespace. You should see a launcher pod and worker pods.
It is normal for the launcher pod to restart a few times until the worker pods are fully running.
```sh
watch kubectl get pods
```
Once the test launcher pod enters status `Running` or `Completed`, see the test logs using the command below:
```log
NAME READY STATUS RESTARTS AGE
efa-info-test-launcher-wm8pm 0/1 CrashLoopBackOff 1 (16s ago) 19s
efa-info-test-worker-0 1/1 Running 0 19s
efa-info-test-worker-1 1/1 Running 0 19s
```
```log
NAME READY STATUS RESTARTS AGE
efa-info-test-launcher-wm8pm 1/1 Running 2 (18s ago) 21s
efa-info-test-worker-0 1/1 Running 0 21s
efa-info-test-worker-1 1/1 Running 0 21s
```
```log
NAME READY STATUS RESTARTS AGE
efa-info-test-launcher-wm8pm 0/1 Completed 2 5m20s
```
Once the test launcher pod enters status `Running` or `Completed`,
see the test logs using the command below:
```sh
kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
```
```text
Warning: Permanently added 'efa-info-test-worker-1.efa-info-test-worker.default.svc,10.11.13.224' (ECDSA) to the list of known hosts.
Warning: Permanently added 'efa-info-test-worker-0.efa-info-test-worker.default.svc,10.11.4.63' (ECDSA) to the list of known hosts.
```log
Warning: Permanently added 'efa-info-test-worker-1.efa-info-test.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'efa-info-test-worker-0.efa-info-test.default.svc' (ED25519) to the list of known hosts.
[1,1]<stdout>:provider: efa
[1,1]<stdout>: fabric: efa
[1,1]<stdout>: domain: rdmap197s0-rdm
[1,1]<stdout>: version: 116.10
[1,1]<stdout>: domain: rdmap79s0-rdm
[1,1]<stdout>: version: 120.10
[1,1]<stdout>: type: FI_EP_RDM
[1,1]<stdout>: protocol: FI_PROTO_EFA
...
[1,0]<stdout>:provider: efa
[1,0]<stdout>: fabric: efa
[1,0]<stdout>: domain: rdmap197s0-rdm
[1,0]<stdout>: version: 116.10
[1,0]<stdout>: domain: rdmap201s0-rdm
[1,0]<stdout>: version: 120.10
[1,0]<stdout>: type: FI_EP_RDM
[1,0]<stdout>: protocol: FI_PROTO_EFA
```
4. EFA NCCL test
Finally, remove the job:
```sh
kubectl delete -f ./efa-info-test.yaml
```
To run the EFA NCCL test please execute the following kubectl command:
4. EFA NCCL test
The EFA NCCL test is used to measure network bandwidth by running the `/opt/nccl-tests/build/all_reduce_perf` utility.
Create an MPIjob manifest by executing the script below:
```sh
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/efa-device-plugin/test-nccl-efa.yaml
./generate-efa-nccl-test.sh
```
This script creates a file named `efa-nccl-test.yaml`. Apply the manifest to start the EFA nccl test.
```text
mpijob.kubeflow.org/test-nccl-efa created
```
```sh
kubectl apply -f ./efa-nccl-test.yaml
Once the launcher pod enters `Running` or `Completed` state, execute the following to see the test logs:
```text
mpijob.kubeflow.org/efa-nccl-test created
```
Similarly to the EFA info test, a launcher and worker pods will be created. The launcher pod will be
in CrashLoopBackoff mode until the worker pods enter Running state.
As soon as the launcher pod enters Running state as well, execute the following command to see the test logs:
```sh
kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
```
```text
[1,0]<stdout>:test-nccl-efa-worker-0:21:21 [0] NCCL INFO NET/OFI Selected Provider is efa (found 1 nics)
[1,0]<stdout>:test-nccl-efa-worker-0:21:21 [0] NCCL INFO Using network AWS Libfabric
[1,0]<stdout>:NCCL version 2.12.7+cuda11.4
...
[1,0]<stdout>:# out-of-place in-place
[1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[1,0]<stdout>: 0 0 float sum -1 0.13 0.00 0.00 0 0.12 0.00 0.00 0
[1,0]<stdout>: 0 0 float sum -1 0.12 0.00 0.00 0 0.12 0.00 0.00 0
[1,0]<stdout>: 4 1 float sum -1 65.43 0.00 0.00 0 65.82 0.00 0.00 0
[1,0]<stdout>: 8 2 float sum -1 64.86 0.00 0.00 0 65.67 0.00 0.00 0
[1,0]<stdout>: 16 4 float sum -1 64.72 0.00 0.00 0 64.83 0.00 0.00 0
[1,0]<stdout>: 32 8 float sum -1 65.47 0.00 0.00 0 65.16 0.00 0.00 0
[1,0]<stdout>: 64 16 float sum -1 65.34 0.00 0.00 0 65.58 0.00 0.00 0
[1,0]<stdout>: 128 32 float sum -1 65.99 0.00 0.00 0 66.28 0.00 0.00 0
[1,0]<stdout>: 256 64 float sum -1 75.81 0.00 0.01 0 66.76 0.00 0.01 0
[1,0]<stdout>: 512 128 float sum -1 69.43 0.01 0.01 0 67.18 0.01 0.01 0
[1,0]<stdout>: 1024 256 float sum -1 82.35 0.01 0.02 0 69.03 0.01 0.03 0
[1,0]<stdout>: 2048 512 float sum -1 72.49 0.03 0.05 0 71.37 0.03 0.05 0
[1,0]<stdout>: 4096 1024 float sum -1 77.47 0.05 0.10 0 77.42 0.05 0.10 0
[1,0]<stdout>: 8192 2048 float sum -1 78.10 0.10 0.20 0 78.01 0.11 0.20 0
[1,0]<stdout>: 16384 4096 float sum -1 93.35 0.18 0.33 0 80.11 0.20 0.38 0
[1,0]<stdout>: 32768 8192 float sum -1 106.6 0.31 0.58 0 96.22 0.34 0.64 0
[1,0]<stdout>: 65536 16384 float sum -1 120.6 0.54 1.02 0 89.06 0.74 1.38 0
[1,0]<stdout>: 131072 32768 float sum -1 93.62 1.40 2.62 0 106.3 1.23 2.31 0
[1,0]<stdout>: 262144 65536 float sum -1 111.5 2.35 4.41 0 111.6 2.35 4.41 0
[1,0]<stdout>: 524288 131072 float sum -1 121.2 4.33 8.11 0 109.9 4.77 8.94 0
[1,0]<stdout>: 1048576 262144 float sum -1 119.7 8.76 16.43 0 118.7 8.83 16.56 0
[1,0]<stdout>: 2097152 524288 float sum -1 143.9 14.58 27.33 0 144.2 14.55 27.28 0
[1,0]<stdout>: 4194304 1048576 float sum -1 163.7 25.62 48.03 0 163.6 25.64 48.08 0
[1,0]<stdout>: 8388608 2097152 float sum -1 195.3 42.95 80.54 0 194.9 43.03 80.69 0
[1,0]<stdout>: 16777216 4194304 float sum -1 278.6 60.22 112.91 0 279.9 59.94 112.38 0
[1,0]<stdout>: 33554432 8388608 float sum -1 459.7 73.00 136.87 0 433.9 77.34 145.01 0
[1,0]<stdout>: 67108864 16777216 float sum -1 587.2 114.29 214.29 0 587.1 114.31 214.34 0
[1,0]<stdout>: 134217728 33554432 float sum -1 926.6 144.85 271.60 0 851.5 157.63 295.55 0
[1,0]<stdout>: 268435456 67108864 float sum -1 1497.8 179.22 336.03 0 1496.0 179.44 336.45 0
[1,0]<stdout>: 536870912 134217728 float sum -1 2558.6 209.83 393.42 0 2560.8 209.65 393.10 0
[1,0]<stdout>: 1073741824 268435456 float sum -1 4553.6 235.80 442.13 0 4553.0 235.83 442.19 0
[1,0]<stdout>: 2147483648 536870912 float sum -1 9062.5 236.96 444.31 0 9060.4 237.02 444.41 0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth : 79.9352
[1,0]<stdout>:#
```
Columns 8 and 12 in the output table show the in-place and out-of-place bus bandwidth calculated for the data size listed in column 1. In this case it is 3.13 and 3.12 GB/s respectively.
Your actual results may be slightly different. The calculated average bus bandwidth is displayed at the bottom of the log when the test finishes after it reaches the max data size,
specified in the mpijob manifest. In this result the average bus bandwidth is 1.15 GB/s.
Columns 9 and 13 in the output table show the in-place and out-of-place bus bandwidth calculated for the data size listed in column 2.
In this case it is at maximum 444.31 and 444.41 GB/s respectively.
Your actual results may be slightly different. The calculated average bus bandwidth is displayed at the end of the log.
In this test run the average bus bandwidth was 79.9352 GB/s.
```text
[1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
...
[1,0]<stdout>: 262144 65536 float sum -1 195.0 1.34 1.34 0 194.0 1.35 1.35 0
[1,0]<stdout>: 524288 131072 float sum -1 296.9 1.77 1.77 0 291.1 1.80 1.80 0
[1,0]<stdout>: 1048576 262144 float sum -1 583.4 1.80 1.80 0 579.6 1.81 1.81 0
[1,0]<stdout>: 2097152 524288 float sum -1 983.3 2.13 2.13 0 973.9 2.15 2.15 0
[1,0]<stdout>: 4194304 1048576 float sum -1 1745.4 2.40 2.40 0 1673.2 2.51 2.51 0
...
[1,0]<stdout>:# Avg bus bandwidth : 1.15327
Lastly, delete the MPIJob:
```sh
kubectl delete -f ./efa-nccl-test.yaml
```
## Destroy
Expand Down
93 changes: 93 additions & 0 deletions patterns/nvidia-gpu-efa/generate-efa-info-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#!/bin/bash

export MPI_JOB_NAME=efa-info-test
export IMAGE_URI=public.ecr.aws/hpc-cloud/nccl-tests:latest
export NUM_WORKERS=2
export GPU_PER_WORKER=8
export EFA_PER_WORKER=32
export TOTAL_GPUS=$((${NUM_WORKERS}*${GPU_PER_WORKER}))

cat <<EOF >> efa-info-test.yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: ${MPI_JOB_NAME}
spec:
runPolicy:
cleanPodPolicy: Running
backoffLimit: 20
slotsPerWorker: ${GPU_PER_WORKER}
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
restartPolicy: OnFailure
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- image: ${IMAGE_URI}
name: ${MPI_JOB_NAME}-launcher
imagePullPolicy: IfNotPresent
env:
- name: LD_LIBRARY_PATH
value: "/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib"
- name: PATH
value: "/opt/amazon/efa/bin:/usr/bin"
- name: XLA_FLAGS
value: "--xla_gpu_cuda_data_dir=/usr/local/cuda"
- name: TF_XLA_FLAGS
value: "--tf_xla_cpu_global_jit"
- name: NCCL_DEBUG
value: INFO
command:
- /opt/amazon/openmpi/bin/mpirun
- --allow-run-as-root
- --tag-output
- -np
- "${TOTAL_GPUS}"
- -bind-to
- none
- -map-by
- slot
- -x
- PATH
- -x
- LD_LIBRARY_PATH
- -x
- XLA_FLAGS
- -x
- TF_XLA_FLAGS
- -x
- NCCL_DEBUG=INFO
- --mca
- pml
- ^cm
- --mca
- pml_rsh_agent=ssh
- --oversubscribe
- /opt/amazon/efa/bin/fi_info
- -p
- "efa"
- -t
- "FI_EP_RDM"
Worker:
replicas: ${NUM_WORKERS}
template:
spec:
containers:
- image: ${IMAGE_URI}
name: ${MPI_JOB_NAME}-worker
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: ${GPU_PER_WORKER}
vpc.amazonaws.com/efa: ${EFA_PER_WORKER}
requests:
nvidia.com/gpu: ${GPU_PER_WORKER}
vpc.amazonaws.com/efa: ${EFA_PER_WORKER}
EOF

Loading

0 comments on commit 6facfa5

Please sign in to comment.