This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Terraform is the canonical path for everything under "infrastructure": VPC endpoints, EKS control plane, system/GPU nodegroups, core addons, CSI drivers, Karpenter, GPU stack. All net-new features land in terraform/ modules.
Bash deployment scripts are maintenance-only. They will be archived to scripts/legacy/ once the structural reshuffle lands. Do not add new capabilities to the bash deployment pipeline. When fixing a bug that exists in both, fix it in terraform; only patch bash if a downstream user explicitly cannot migrate yet.
Bash kept permanently for runtime / ops tools that don't fit a declarative model:
option_inspect_eks.sh(post-apply 9-check health audit)option_verify_gpu_efa.sh(NCCL benchmark)option_show_nodegroup_topology.sh(readstopology.k8s.aws/...labels at runtime)option_create_bastion.sh(operator-driven bastion lifecycle)
Automated deployment system for production-grade AWS EKS clusters with advanced features including LVM-configured system nodes, Pod Identity authentication, and optional components (Karpenter, CSI drivers for EBS/EFS/FSx/S3, GPU nodes).
Key Characteristics:
- Deployment Environment: Must run from bastion host inside VPC (private API endpoint)
- Authentication: Pod Identity (not IRSA/OIDC) for all AWS service integrations
- Configuration: terraform via
terraform.tfvars; legacy bash via.env(sourced from0_setup_env.sh) - Architecture: terraform modules under
terraform/modules/; legacy bash scripts use sequential numbered files for core setup andoption_*for optional features
cd terraform
# One-time per account/region: bootstrap state backend
terraform -chdir=bootstrap apply -var="bucket_name=my-eks-tfstate" -var="region=us-west-2"
# Configure variables
cp terraform.tfvars.example terraform.tfvars
$EDITOR terraform.tfvars
# Apply (must run from inside the cluster's VPC for private mode)
terraform init \
-backend-config="bucket=my-eks-tfstate" \
-backend-config="key=eks-cluster-deployment/dev/terraform.tfstate" \
-backend-config="region=us-west-2" \
-backend-config="dynamodb_table=my-eks-tfstate-lock"
terraform plan
terraform applyModule-level toggles (install_karpenter, install_efs_csi, install_fsx_csi, install_s3_csi, install_gpu_nodegroups, install_gpu_stack, gpu_stack_mode, etc.) live in terraform.tfvars. See terraform/README.md for the full layout, three-stack split for private deploys, and safe-destroy.sh teardown procedure.
# 9-check cluster health audit
./scripts/option_inspect_eks.sh
# Cross-node NCCL benchmark (validates EFA + GPUDirect)
./scripts/option_verify_gpu_efa.sh
# Print AWS-native topology labels per nodegroup
./scripts/option_show_nodegroup_topology.sh
# Create SSM-only bastion (alternative to terraform/bootstrap-bastion/)
./scripts/option_create_bastion.sh
# Test Karpenter node provisioning (workload sample)
./examples/option_test_karpenter_pools.sh
# Test pod scheduling on system nodes (workload sample)
./examples/option_test_pod_scheduling.shThe old 1_* ~ 7_* and option_install_* scripts have moved to scripts/legacy/ and are maintenance-only. See scripts/legacy/README.md and docs/MIGRATION_FROM_BASH.md.
# Check cluster status
aws eks describe-cluster --name ${CLUSTER_NAME} --region ${AWS_REGION}
# Verify nodes
kubectl get nodes -o wide
# Check system pods
kubectl get pods -n kube-system
# Verify addons
aws eks list-addons --cluster-name ${CLUSTER_NAME} --region ${AWS_REGION}
# Test storage classes
kubectl get storageclass
# View metrics
kubectl top nodes
kubectl top pods -ANumbered Scripts (0-7): Core deployment sequence that must run in order
0_setup_env.sh: Environment configuration loader and validation functions (always source this first)1-3: Network infrastructure setup (VPC DNS, validation, endpoints)4: EKS cluster control plane creation5: Local environment check (optional; alternative to bastion)6: System nodegroup creation (with LVM-backed containerd storage)7: Core addon installation (CoreDNS, Cluster Autoscaler, ALB Controller, EBS CSI, Metrics Server)
option_ Scripts*: Optional features that can be installed after core deployment
- Can run independently after core setup completes
- Idempotent and safe to re-run
All AWS integrations use Pod Identity (not IRSA/OIDC). Helper functions in pod_identity_helpers.sh:
# Key helper functions
create_pod_identity_role <role_name> # Create IAM role with Pod Identity trust policy
attach_managed_policy <role_name> <policy_arn> # Attach AWS managed policy
attach_custom_policy <role_name> <policy_name> <policy_document> # Attach custom policy
create_pod_identity_association <namespace> <sa> <role_arn> # Associate role with K8s SAPattern for adding new components:
- Source
0_setup_env.shandpod_identity_helpers.sh - Create IAM role with
create_pod_identity_role - Attach necessary policies
- Create Pod Identity association
- Deploy K8s manifests with ServiceAccount
terraform/ # Canonical infra-as-code (use this)
├── main.tf / variables.tf / outputs.tf / providers.tf / versions.tf
├── terraform.tfvars.example
├── bootstrap/ # state backend (S3 + DynamoDB)
├── bootstrap-vpc/ # standalone 3-AZ VPC for testing
├── bootstrap-bastion/ # SSM-only bastion for private deploys
├── modules/ # vpc-endpoints / eks-cluster / eks-system-nodegroup /
│ # eks-addons / eks-csi-drivers / eks-karpenter /
│ # eks-gpu-nodegroup / eks-gpu-stack
├── assets/ # static files referenced by modules
│ ├── iam/ # alb-controller / fsx-csi IAM policy JSON
│ └── karpenter/ # EC2NodeClass + NodePool YAML templates
└── scripts/safe-destroy.sh # helm uninstall → terraform destroy wrapper
scripts/ # Operational tools (kept as bash)
├── 0_setup_env.sh # shared by ops tools and legacy
├── topology_inventory_lib.sh # lib for topology / GPU verify
├── option_inspect_eks.sh # 9-check post-apply health audit
├── option_verify_gpu_efa.sh # cross-node NCCL benchmark
├── option_show_nodegroup_topology.sh # print AWS-native topology labels
├── option_create_bastion.sh # operator-driven bastion lifecycle
└── legacy/ # deprecated bash deployment pipeline
├── 1_*.sh ... 7_*.sh # old numbered sequence
├── option_install_*.sh # csi / karpenter / gpu-nodegroups / gpu-stack
├── pod_identity_helpers.sh
├── disk_detection_lib.sh
├── instance_arch_lib.sh
└── manifests/ # bash-only YAMLs (autoscaler, storageclass)
examples/ # Workload samples + sanity test scripts
├── option_test_pod_scheduling.sh
├── option_test_karpenter_pools.sh
└── *.yaml # ebs / efs / fsx / s3 / nlb test apps
docs/ # See MIGRATION_FROM_BASH.md and DEPLOYMENT_SOP.md
Terraform (canonical): terraform/terraform.tfvars (copy from terraform.tfvars.example)
- Required:
cluster_name,vpc_id,private_subnet_ids,public_subnet_ids - Auto-detected via provider: account ID, region (from AWS CLI / env)
- Toggles:
install_karpenter,install_efs_csi,install_fsx_csi,install_s3_csi,install_gpu_nodegroups,install_gpu_stack,gpu_stack_mode - Multi-AZ:
private_subnet_ids/public_subnet_idsare lists, supporting 2-4 AZs - See
terraform/variables.tffor the full list and defaults
Legacy .env: kept for scripts/legacy/* callers; full variable list in .env.example. Variable mapping to terraform.tfvars is documented in docs/MIGRATION_FROM_BASH.md.
System nodes (app=eks-utils label) run cluster infrastructure:
- CoreDNS, Cluster Autoscaler, AWS Load Balancer Controller
- EBS CSI Driver, Metrics Server
- Uses LVM configuration: 50GB root + 100GB data volume
- containerd data directory on separate LVM volume for performance
Important: All addon manifests use node selectors to schedule on system nodes.
All CSI drivers are optional (via option_install_csi_drivers.sh):
- EBS: Block storage with gp3 (default) and io2 StorageClasses
- EFS: Shared filesystem across pods/nodes
- FSx: Lustre for HPC/ML workloads (requires PERSISTENT_2 for AL2023 lustre-client 2.15 compatibility)
- S3: Object storage mounting (Standard S3 and S3 Express One Zone, single replica - no HA needed)
New components land in terraform as a module under terraform/modules/<component>/. Follow the existing modules for the shape:
main.tf— IAM role +aws_eks_pod_identity_associationfor AWS permissions;helm_releaseoraws_eks_addonfor the workloadvariables.tf/outputs.tf/versions.tf- Wire into
terraform/main.tfwith acount = var.install_<component> ? 1 : 0toggle - Static assets (YAML templates, IAM JSON) live under
terraform/assets/<component>/ - Add the toggle and any tunables to
terraform/variables.tfandterraform.tfvars.example - Use
node_selector = { "${var.system_node_label_key}" = var.system_node_label_value }if the component should run on system nodes
Do not add new bash scripts under scripts/legacy/ for this purpose.
- Resource existence: TF
datablocks fail at plan time - IAM propagation retries: AWS provider handles internally
- Pre-deploy CLI tooling check: not done by TF; operator concern (use
option_inspect_eks.shafter apply)
The legacy bash helpers (verify_kubectl_context, validate_vpc_exists, etc.) still exist in scripts/0_setup_env.sh and are used by the kept ops tools and examples/option_test_*.sh.
CPU Nodes (Graviton/x86): terraform module terraform/modules/eks-karpenter/ (legacy: scripts/legacy/option_install_karpenter.sh)
- EC2NodeClass:
terraform/assets/karpenter/ec2nodeclass-graviton.yaml,ec2nodeclass-x86.yaml - Graviton (arm64): r/c/m Graviton3+Graviton4 family, 4-16 vCPU, on-demand (example defaults — see manifest header)
- x86 (amd64): r/c/m Intel 6th+7th gen family, 4-16 vCPU, on-demand
- LVM configuration for containerd data volume
GPU support is split across two scripts (and two terraform modules) by responsibility:
Layer 1 — terraform/modules/eks-gpu-nodegroup/ (AWS infra; legacy: scripts/legacy/option_install_gpu_nodegroups.sh)
- Uses AWS Managed Node Groups (not Karpenter) for EFA multi-NIC support
- IAM role + GPU SG (with EFA self-egress) + Launch Template + NodeGroup
- EFA interface counts:
- p5.48xlarge: 32 ENIs (1 primary + 31 EFA-only)
- p5en.48xlarge: 16 ENIs (1 primary + 15 EFA-only)
- p6-b200.48xlarge: 8 ENIs (1 primary + 7 EFA-only)
- p6-b300.48xlarge: 17 ENIs (1 primary + 16 EFA-only)
- g7e.48xlarge: 4 ENIs (1 primary + 3 EFA-only)
- Pricing options (mutually exclusive — choose ONE): OD / Spot / ODCR / CB
- LVM configuration for containerd data volume + Instance Store scratch
- Node labels:
workload-type=gpu,gpu-instance-type=<type>,purchase-option=<od|spot|odcr|cb> - Taints:
nvidia.com/gpu:NoSchedule
Layer 2 — terraform/modules/eks-gpu-stack/ (K8s workloads; legacy: scripts/legacy/option_install_gpu_stack.sh)
- Two mutually-exclusive modes via
gpu_stack_mode(terraform) /GPU_STACK_MODE(legacy):standard(default): nvidia-device-plugin + EFA plugin + dcgm-exporter + node-problem-detector + gpu-health-check DSoperator: NVIDIA GPU Operator (driver/toolkit/mofed disabled) + EFA plugin
- Terraform: switching mode triggers
terraform applyto retire stale resources from the previous mode - Legacy: mode-switch protected by
GPU_STACK_FORCE_SWITCH=trueto auto-uninstall conflicting releases; auto-invoked from layer 1 by default; skip withSKIP_GPU_STACK_AUTO_INSTALL=true
Test manifests in examples/:
- Test pod scheduling (Graviton/x86)
- Various storage test pods (EBS, EFS, S3)
- Karpenter scaling tests
- Always run from bastion: Cluster uses private API endpoint, requires VPC internal access
- System node labels:
app=eks-utilslabel critical for addon scheduling - Multi-AZ support: Supports 2-4 availability zones (minimum: A/B, optional: C, D)
- Terraform is canonical: full deployment lives in
terraform/; legacy bash underscripts/legacy/is maintenance-only - Documentation: See
terraform/README.mdfor the terraform path,docs/MIGRATION_FROM_BASH.mdfor bash↔terraform mapping,docs/DEPLOYMENT_SOP.mdfor the legacy step-by-step (still applicable to bash-deployed clusters),docs/DESIGN.mdfor future features - Bash quirks (when working in
scripts/legacy/):- Source
scripts/0_setup_env.shfirst; legacy scripts reference it via${SCRIPT_DIR}/../0_setup_env.sh - Use
verify_kubectl_contextbefore kubectl operations
- Source