Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AZ mapping Code changes #421

Merged
merged 51 commits into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
1457d05
MPI operator code for distributed training
sanjeevrg89 Oct 31, 2023
0a46f3a
Making MPI operator optional for users
sanjeevrg89 Nov 6, 2023
fbd3ba9
added type string to mpi operator variable version
sanjeevrg89 Nov 6, 2023
ed0d72a
Merge branch 'awslabs:main' into main
sanjeevrg89 Nov 10, 2023
017a0ad
Merge branch 'awslabs:main' into main
sanjeevrg89 Dec 13, 2023
c415097
llama2 examples
sanjeevrg89 Dec 13, 2023
08634ed
llama2 pretraining updates
5cp Dec 14, 2023
fe66508
fix typo
5cp Dec 14, 2023
ff51584
Merge pull request #1 from 5cp/llama_updates
sanjeevrg89 Dec 14, 2023
74cbfd2
install pre-req script
sanjeevrg89 Dec 14, 2023
de1d173
more tools to prereq shell script
sanjeevrg89 Dec 14, 2023
92a8873
addtional tooling
sanjeevrg89 Dec 14, 2023
3cde526
addtional tooling python
sanjeevrg89 Dec 14, 2023
79b2b6d
AZ fix
sanjeevrg89 Dec 14, 2023
b1f8343
added jq
sanjeevrg89 Dec 14, 2023
a8cdf82
added tool checks
sanjeevrg89 Dec 14, 2023
30b259b
get az script update
sanjeevrg89 Dec 14, 2023
f5dbc09
az code fix
sanjeevrg89 Dec 14, 2023
407fa49
az code fix
sanjeevrg89 Dec 14, 2023
0c35b43
fix az script
sanjeevrg89 Dec 15, 2023
289bbfd
fix az script json output
sanjeevrg89 Dec 15, 2023
53cb92e
bug fix - always store ecr repo uri
5cp Dec 15, 2023
feaf9db
Merge pull request #2 from 5cp/llama_updates
sanjeevrg89 Dec 15, 2023
080abb5
eks and main code changes
sanjeevrg89 Dec 15, 2023
401e177
llama2 trainium doc
sanjeevrg89 Dec 15, 2023
e57cb42
initial doc updates
5cp Dec 15, 2023
2f85081
more llama doc updates
5cp Dec 15, 2023
9b76684
more updates
5cp Dec 15, 2023
1e5e8da
more updates
5cp Dec 15, 2023
7b5ac67
add subheadings to docs
5cp Dec 15, 2023
7e3d377
update tensorboard blurb
5cp Dec 15, 2023
cf691d5
minor tweak
5cp Dec 15, 2023
a5f9d5b
missing img folder
sanjeevrg89 Dec 19, 2023
ae30478
PR review requested changes
sanjeevrg89 Jan 2, 2024
3c2b71f
Automatically select appropriate trn1/inf2-supporting AZs based on us…
5cp Jan 4, 2024
700d5e6
added variables for trn1 and inf2 instance sizes
sanjeevrg89 Jan 4, 2024
4ede8eb
redo instance size variables for inf2 and trn1n
sanjeevrg89 Jan 4, 2024
ecbe68a
instance size variables fix
sanjeevrg89 Jan 4, 2024
51ef0be
fix trn1 default max size setting
sanjeevrg89 Jan 4, 2024
b71e27f
llama2 training doc update
sanjeevrg89 Jan 4, 2024
3d0d674
code changes to map AZs
sanjeevrg89 Jan 16, 2024
6eb0099
AZ fetch code changes
sanjeevrg89 Jan 16, 2024
49cf49a
reverted back to original AZ implementation
sanjeevrg89 Jan 17, 2024
0620075
addressed latest PR reviewed changes
sanjeevrg89 Jan 19, 2024
100aa25
Fix trn1 nodegroups so they use the preferred subnet/AZ
5cp Jan 31, 2024
8a43b23
Merge pull request #3 from 5cp/trn1_az_fix
sanjeevrg89 Jan 31, 2024
c179262
az changes for trn1
sanjeevrg89 Jan 31, 2024
19dc56c
pre-req script fix
sanjeevrg89 Jan 31, 2024
22d10cf
pre-req issue fix
sanjeevrg89 Jan 31, 2024
c86172d
AZ mapping changes
sanjeevrg89 Jan 31, 2024
74d662f
fixed spelling mistakes
sanjeevrg89 Jan 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 11 additions & 12 deletions ai-ml/trainium-inferentia/eks.tf
Original file line number Diff line number Diff line change
Expand Up @@ -133,11 +133,11 @@ module "eks" {
trn1-32xl-ng1 = {
name = "trn1-32xl-ng1"
description = "Tran1 32xlarge node group for hosting ML workloads"
# The code filters the private subnets based on their CIDR blocks and selects the subnet ID if the CIDR block starts with "100." Otherwise, it assigns a null value.
# The element(compact([...]), 0) expression ensures that only the first non-null value is included in the resulting list of subnet IDs.
subnet_ids = [element(compact([for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) :
substr(cidr_block, 0, 4) == "100." ? subnet_id : null]), 0)]

# All trn1 instances should be launched into the same subnet in the preferred trn1 AZ
# The preferred AZ is the first AZ listed in the AZ id <-> region mapping in main.tf.
# We use index 2 to select the subnet in AZ1 with the 100.x CIDR:
# module.vpc.private_subnets = [AZ1_10.x, AZ2_10.x, AZ1_100.x, AZ2_100.x]
subnet_ids = [module.vpc.private_subnets[2]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is launching them in the same subnet a requirement? This requirement would make it actually more brittle as a generic blueprint, as instance availability is highly dynamic and varies across AZs across different regions. Ideally all available subnets would be supplied so that EC2 Auto Scaling and/or Karpenter can retry in different subnets on failure due to unavailabilty or lack of support.

Copy link
Collaborator

@vara-bonthu vara-bonthu Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These workloads are intended to operate within a single subnet, and our previous approach also filters to the first private subnet. I recognize that a key challenge here is the availability of Trn1 instances in specific regions and AZs.

@sanjeevrg89's solution addresses this by allowing selection of only those regions and AZs where Trn1 instances are available, as outlined in the az_mapping field section of the main.tf file. This mapping field can be expanded to include additional regions and AZs where Trn1 availability exists.

We can do better in write a documentation to explain this to users

# aws ssm get-parameters --names /aws/service/eks/optimized-ami/1.27/amazon-linux-2-gpu/recommended/image_id --region us-west-2
# ami_id = "ami-0e0deb7ae582f6fe9" # Use this to pass custom AMI ID and ignore ami_type
ami_type = "AL2_x86_64_GPU" # Contains Neuron driver
Expand Down Expand Up @@ -278,15 +278,14 @@ module "eks" {
trn1n-32xl-ng = {
name = "trn1n-32xl-ng"
description = "trn1n 32xlarge node group for hosting ML workloads"
# The code filters the private subnets based on their CIDR blocks and selects the subnet ID if the CIDR block starts with "100." Otherwise, it assigns a null value.
# The element(compact([...]), 0) expression ensures that only the first non-null value is included in the resulting list of subnet IDs.
subnet_ids = [element(compact([for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) :
substr(cidr_block, 0, 4) == "100." ? subnet_id : null]), 0)
]

# All trn1 instances should be launched into the same subnet in the preferred trn1 AZ
# The preferred AZ is the first AZ listed in the AZ id <-> region mapping in main.tf.
# We use index 2 to select the subnet in AZ1 with the 100.x CIDR:
# module.vpc.private_subnets = [AZ1_10.x, AZ2_10.x, AZ1_100.x, AZ2_100.x]
subnet_ids = [module.vpc.private_subnets[2]]
# aws ssm get-parameters --names /aws/service/eks/optimized-ami/1.27/amazon-linux-2-gpu/recommended/image_id --region us-west-2
# ami_id = "ami-0e0deb7ae582f6fe9" # Use this to pass custom AMI ID and ignore ami_type
ami_type = "AL2_x86_64_GPU"
ami_type = "AL2_x86_64_GPU" # Contains Neuron driver
instance_types = ["trn1n.32xlarge"]

pre_bootstrap_user_data = <<-EOT
Expand Down
56 changes: 51 additions & 5 deletions ai-ml/trainium-inferentia/examples/llama2/install-pre-requsites-for-ec2.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@
install_docker() {
echo "Checking and installing Docker..."
sudo yum install docker -y
sudo systemctl start docker

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd you change this? The existing code should be correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanjeevrg89 Could you please verify the line that you removed and update accordingly

sudo service docker start
sudo usermod -aG docker $(whoami)
newgrp docker
# newgrp docker removed to prevent script interruption
}

# Install a package if it is not already installed
# Function to install a package using yum
install_package() {
PACKAGE=$1
echo "Checking for $PACKAGE..."
Expand All @@ -21,6 +21,46 @@ install_package() {
fi
}

# Function to install kubectl
install_kubectl() {
echo "Installing kubectl..."
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
}

# Function to install Terraform
install_terraform() {
echo "Installing Terraform..."
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
sudo yum install -y terraform
}

# Function to install AWS CLI v2
install_aws_cli() {
echo "Installing AWS CLI v2..."
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
echo "AWS CLI v2 installed successfully."
}

# Function to install Helm
install_helm() {
echo "Installing Helm..."
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
echo "Helm installed successfully."
}

# Function to install Boto3
install_boto3() {
echo "Installing Boto3..."
pip3 install boto3
echo "Boto3 installed successfully."
}

echo "Starting installation of prerequisites..."

# Install Docker
Expand All @@ -33,7 +73,13 @@ install_package unzip
install_package python3-pip
install_package jq

# Additional installations (kubectl, AWS CLI v2, Terraform, Helm, Boto3)...
# (Include the existing logic for these installations here, with similar echo statements for tracking)
# Install kubectl, Terraform, AWS CLI v2, Helm, and Boto3
install_kubectl
install_terraform
install_aws_cli
install_helm
install_boto3

echo "Installation of prerequisites complete."


69 changes: 65 additions & 4 deletions ai-ml/trainium-inferentia/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,72 @@ data "aws_ecrpublic_authorization_token" "token" {
locals {
name = var.name
region = var.region
# Training and Inference instances are available in the following AZs us-east-1 and us-west-2
# You can find the list of supported AZs here: https://aws.amazon.com/ec2/instance-types/trn1/
azs = ["${local.region}c", "${local.region}d"]
# Trn1 and Inf2 instances are available in specific AZs in us-east-1,
# us-east-2, and us-west-2. For Trn1, the first AZ id (below) should be used.
az_mapping = {
"us-west-2" = ["usw2-az4", "usw2-az1"],
"us-east-1" = ["use1-az6", "use1-az5"],
"us-east-2" = ["use2-az3", "use2-az1"]
}
azs = local.az_mapping[var.region]
tags = {
Blueprint = local.name
GithubRepo = "github.com/awslabs/data-on-eks"
}
}
}
provider "aws" {
region = local.region
}

# ECR always authenticates with `us-east-1` region
# Docs -> https://docs.aws.amazon.com/AmazonECR/latest/public/public-registries.html
provider "aws" {
alias = "ecr"
region = "us-east-1"
}

provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.this.token
}

provider "helm" {
kubernetes {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.this.token
}
}
provider "kubectl" {
apply_retry_count = 30
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.this.token
load_config_file = false
}

data "aws_eks_cluster_auth" "this" {
name = module.eks.cluster_name
}

data "aws_ecrpublic_authorization_token" "token" {
provider = aws.ecr
}

locals {
name = var.name
region = var.region
# Trn1 and Inf2 instances are available in specific AZs in us-east-1,
# us-east-2, and us-west-2. For Trn1, the first AZ id (below) should be used.
az_mapping = {
"us-west-2" = ["usw2-az4", "usw2-az1"],
"us-east-1" = ["use1-az6", "use1-az5"],
"us-east-2" = ["use2-az3", "use2-az1"]
}
azs = local.az_mapping[var.region]
tags = {
Blueprint = local.name
GithubRepo = "github.com/awslabs/data-on-eks"
}
}
Loading