Skip to content

Commit

Permalink
refactor: Update EFA pattern name for discoverability; add info on wh…
Browse files Browse the repository at this point in the history
…ats provided and render code of significance in doc site (#1939)
  • Loading branch information
bryantbiggs authored May 3, 2024
1 parent 5793945 commit 9510cec
Show file tree
Hide file tree
Showing 11 changed files with 258 additions and 279 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ repos:
- id: detect-aws-credentials
args: [--allow-missing-credentials]
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.89.0
rev: v1.89.1
hooks:
- id: terraform_fmt
- id: terraform_docs
Expand Down
1 change: 1 addition & 0 deletions docs/cSpell_dict.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ crds
curlimages
cwlogs
daemonset
datasource
dcgm
distro
ecrpublic
Expand Down
7 changes: 0 additions & 7 deletions docs/patterns/elastic-fabric-adapter.md

This file was deleted.

7 changes: 7 additions & 0 deletions docs/patterns/nvidia-gpu-efa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
title: NVIDIA GPUs with EFA
---

{%
include-markdown "../../patterns/nvidia-gpu-efa/README.md"
%}
244 changes: 0 additions & 244 deletions patterns/elastic-fabric-adapter/main.tf

This file was deleted.

4 changes: 0 additions & 4 deletions patterns/elastic-fabric-adapter/outputs.tf

This file was deleted.

Empty file.
21 changes: 0 additions & 21 deletions patterns/elastic-fabric-adapter/versions.tf

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
# EKS Cluster w/ Elastic Fabric Adapter
# EKS Cluster w/ NVIDIA GPUs and EFA for Machine Learning

This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup.
This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that utilizes `p5.48xlarge` instances with H100 NVIDIA GPUs used in distributed, multi-node machine learning workloads.

The following components are demonstrated in this pattern:

- A "default" node group that supports addons and components that do not require GPUs nor EFA devices. Any pods that do not tolerate the taints of the GPU node group will be scheduled on instances within this node group.
- A node group of `p5.48xlarge` instances with
- all x32 [EFA network interfaces](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) enabled
- provisioned within a placement group so that the instances are provisioned close to one another in a single availability zone that supports the instance type.
- a common NVIDIA taint of `"nvidia.com/gpu:NoSchedule"` to ensure only the intended applications are allowed to run on the nodes created
- two labels to identify that this nodegroup supports NVIDIA GPUs and EFA devices and allow pods to use node selectors with these labels
- the NVME instance store volumes are mounted in a RAID-0 array to provide a single, large, high-performance storage volume for the GPU workloads
- kubelet and containerd are configured to utilize the RAID-0 volume, allowing kubelet to discover the additional storage as ephemeral storage that can be utilized by pods
- A Helm chart deployment for the [NVIDIA device plugin](https://github.com/NVIDIA/k8s-device-plugin) to expose and mount the GPUs provided by the instances to the pods that request them
- A Helm chart deployment for the EFA device plugin to expose and mount the EFA network interfaces provided by the instances to the pods that request them. Since the EFA network interfaces are only found on the instances that provide NVIDIA GPUs in this pattern, we do not apply an additional taint for the EFA network interfaces to avoid over-constraining.

## Code

```terraform hl_lines="23-25 31-68"
{% include "../../patterns/nvidia-gpu-efa/eks.tf" %}
```

## Deploy

Expand Down
Loading

0 comments on commit 9510cec

Please sign in to comment.