refactor: Update EFA pattern name for discoverability; add info on wh…

…ats provided and render code of significance in doc site (#1939)
aws-ia · May 3, 2024 · 9510cec · 9510cec
1 parent 5793945
commit 9510cec
Show file tree

Hide file tree

Showing 11 changed files with 258 additions and 279 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -19,7 +19,7 @@ repos:
       - id: detect-aws-credentials
         args: [--allow-missing-credentials]
   - repo: https://github.com/antonbabenko/pre-commit-terraform
-    rev: v1.89.0
+    rev: v1.89.1
     hooks:
       - id: terraform_fmt
       - id: terraform_docs

diff --git a/docs/cSpell_dict.txt b/docs/cSpell_dict.txt
@@ -52,6 +52,7 @@ crds
 curlimages
 cwlogs
 daemonset
+datasource
 dcgm
 distro
 ecrpublic

diff --git a/docs/patterns/elastic-fabric-adapter.md b/docs/patterns/elastic-fabric-adapter.md
diff --git a/docs/patterns/nvidia-gpu-efa.md b/docs/patterns/nvidia-gpu-efa.md
@@ -0,0 +1,7 @@
+---
+title: NVIDIA GPUs with EFA
+---
+
+{%
+   include-markdown "../../patterns/nvidia-gpu-efa/README.md"
+%}
diff --git a/patterns/elastic-fabric-adapter/main.tf b/patterns/elastic-fabric-adapter/main.tf
diff --git a/patterns/elastic-fabric-adapter/outputs.tf b/patterns/elastic-fabric-adapter/outputs.tf
diff --git a/patterns/elastic-fabric-adapter/variables.tf b/patterns/elastic-fabric-adapter/variables.tf
diff --git a/patterns/elastic-fabric-adapter/versions.tf b/patterns/elastic-fabric-adapter/versions.tf
diff --git a/patterns/elastic-fabric-adapter/README.md → patterns/nvidia-gpu-efa/README.md b/patterns/elastic-fabric-adapter/README.md → patterns/nvidia-gpu-efa/README.md
@@ -1,6 +1,25 @@
-# EKS Cluster w/ Elastic Fabric Adapter
+# EKS Cluster w/ NVIDIA GPUs and EFA for Machine Learning
 
-This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup.
+This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that utilizes `p5.48xlarge` instances with H100 NVIDIA GPUs used in distributed, multi-node machine learning workloads.
+
+The following components are demonstrated in this pattern:
+
+- A "default" node group that supports addons and components that do not require GPUs nor EFA devices. Any pods that do not tolerate the taints of the GPU node group will be scheduled on instances within this node group.
+- A node group of `p5.48xlarge` instances with
+    - all x32 [EFA network interfaces](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) enabled
+    - provisioned within a placement group so that the instances are provisioned close to one another in a single availability zone that supports the instance type.
+    - a common NVIDIA taint of `"nvidia.com/gpu:NoSchedule"` to ensure only the intended applications are allowed to run on the nodes created
+    - two labels to identify that this nodegroup supports NVIDIA GPUs and EFA devices and allow pods to use node selectors with these labels
+    - the NVME instance store volumes are mounted in a RAID-0 array to provide a single, large, high-performance storage volume for the GPU workloads
+    - kubelet and containerd are configured to utilize the RAID-0 volume, allowing kubelet to discover the additional storage as ephemeral storage that can be utilized by pods
+- A Helm chart deployment for the [NVIDIA device plugin](https://github.com/NVIDIA/k8s-device-plugin) to expose and mount the GPUs provided by the instances to the pods that request them
+- A Helm chart deployment for the EFA device plugin to expose and mount the EFA network interfaces provided by the instances to the pods that request them. Since the EFA network interfaces are only found on the instances that provide NVIDIA GPUs in this pattern, we do not apply an additional taint for the EFA network interfaces to avoid over-constraining.
+
+## Code
+
+```terraform hl_lines="23-25 31-68"
+{% include  "../../patterns/nvidia-gpu-efa/eks.tf" %}
+```
 
 ## Deploy