-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AZ mapping Code changes #421
Merged
Merged
Changes from all commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
1457d05
MPI operator code for distributed training
sanjeevrg89 0a46f3a
Making MPI operator optional for users
sanjeevrg89 fbd3ba9
added type string to mpi operator variable version
sanjeevrg89 ed0d72a
Merge branch 'awslabs:main' into main
sanjeevrg89 017a0ad
Merge branch 'awslabs:main' into main
sanjeevrg89 c415097
llama2 examples
sanjeevrg89 08634ed
llama2 pretraining updates
5cp fe66508
fix typo
5cp ff51584
Merge pull request #1 from 5cp/llama_updates
sanjeevrg89 74cbfd2
install pre-req script
sanjeevrg89 de1d173
more tools to prereq shell script
sanjeevrg89 92a8873
addtional tooling
sanjeevrg89 3cde526
addtional tooling python
sanjeevrg89 79b2b6d
AZ fix
sanjeevrg89 b1f8343
added jq
sanjeevrg89 a8cdf82
added tool checks
sanjeevrg89 30b259b
get az script update
sanjeevrg89 f5dbc09
az code fix
sanjeevrg89 407fa49
az code fix
sanjeevrg89 0c35b43
fix az script
sanjeevrg89 289bbfd
fix az script json output
sanjeevrg89 53cb92e
bug fix - always store ecr repo uri
5cp feaf9db
Merge pull request #2 from 5cp/llama_updates
sanjeevrg89 080abb5
eks and main code changes
sanjeevrg89 401e177
llama2 trainium doc
sanjeevrg89 e57cb42
initial doc updates
5cp 2f85081
more llama doc updates
5cp 9b76684
more updates
5cp 1e5e8da
more updates
5cp 7b5ac67
add subheadings to docs
5cp 7e3d377
update tensorboard blurb
5cp cf691d5
minor tweak
5cp a5f9d5b
missing img folder
sanjeevrg89 ae30478
PR review requested changes
sanjeevrg89 3c2b71f
Automatically select appropriate trn1/inf2-supporting AZs based on us…
5cp 700d5e6
added variables for trn1 and inf2 instance sizes
sanjeevrg89 4ede8eb
redo instance size variables for inf2 and trn1n
sanjeevrg89 ecbe68a
instance size variables fix
sanjeevrg89 51ef0be
fix trn1 default max size setting
sanjeevrg89 b71e27f
llama2 training doc update
sanjeevrg89 3d0d674
code changes to map AZs
sanjeevrg89 6eb0099
AZ fetch code changes
sanjeevrg89 49cf49a
reverted back to original AZ implementation
sanjeevrg89 0620075
addressed latest PR reviewed changes
sanjeevrg89 100aa25
Fix trn1 nodegroups so they use the preferred subnet/AZ
5cp 8a43b23
Merge pull request #3 from 5cp/trn1_az_fix
sanjeevrg89 c179262
az changes for trn1
sanjeevrg89 19dc56c
pre-req script fix
sanjeevrg89 22d10cf
pre-req issue fix
sanjeevrg89 c86172d
AZ mapping changes
sanjeevrg89 74d662f
fixed spelling mistakes
sanjeevrg89 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,12 +4,12 @@ | |
install_docker() { | ||
echo "Checking and installing Docker..." | ||
sudo yum install docker -y | ||
sudo systemctl start docker | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why'd you change this? The existing code should be correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sanjeevrg89 Could you please verify the line that you removed and update accordingly |
||
sudo service docker start | ||
sudo usermod -aG docker $(whoami) | ||
newgrp docker | ||
# newgrp docker removed to prevent script interruption | ||
} | ||
|
||
# Install a package if it is not already installed | ||
# Function to install a package using yum | ||
install_package() { | ||
PACKAGE=$1 | ||
echo "Checking for $PACKAGE..." | ||
|
@@ -21,6 +21,46 @@ install_package() { | |
fi | ||
} | ||
|
||
# Function to install kubectl | ||
install_kubectl() { | ||
echo "Installing kubectl..." | ||
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" | ||
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl | ||
} | ||
|
||
# Function to install Terraform | ||
install_terraform() { | ||
echo "Installing Terraform..." | ||
sudo yum install -y yum-utils | ||
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo | ||
sudo yum install -y terraform | ||
} | ||
|
||
# Function to install AWS CLI v2 | ||
install_aws_cli() { | ||
echo "Installing AWS CLI v2..." | ||
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" | ||
unzip awscliv2.zip | ||
sudo ./aws/install | ||
echo "AWS CLI v2 installed successfully." | ||
} | ||
|
||
# Function to install Helm | ||
install_helm() { | ||
echo "Installing Helm..." | ||
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | ||
chmod 700 get_helm.sh | ||
./get_helm.sh | ||
echo "Helm installed successfully." | ||
} | ||
|
||
# Function to install Boto3 | ||
install_boto3() { | ||
echo "Installing Boto3..." | ||
pip3 install boto3 | ||
echo "Boto3 installed successfully." | ||
} | ||
|
||
echo "Starting installation of prerequisites..." | ||
|
||
# Install Docker | ||
|
@@ -33,7 +73,13 @@ install_package unzip | |
install_package python3-pip | ||
install_package jq | ||
|
||
# Additional installations (kubectl, AWS CLI v2, Terraform, Helm, Boto3)... | ||
# (Include the existing logic for these installations here, with similar echo statements for tracking) | ||
# Install kubectl, Terraform, AWS CLI v2, Helm, and Boto3 | ||
install_kubectl | ||
install_terraform | ||
install_aws_cli | ||
install_helm | ||
install_boto3 | ||
|
||
echo "Installation of prerequisites complete." | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is launching them in the same subnet a requirement? This requirement would make it actually more brittle as a generic blueprint, as instance availability is highly dynamic and varies across AZs across different regions. Ideally all available subnets would be supplied so that EC2 Auto Scaling and/or Karpenter can retry in different subnets on failure due to unavailabilty or lack of support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These workloads are intended to operate within a single subnet, and our previous approach also filters to the first private subnet. I recognize that a key challenge here is the availability of Trn1 instances in specific regions and AZs.
@sanjeevrg89's solution addresses this by allowing selection of only those regions and AZs where
Trn1
instances are available, as outlined in theaz_mapping
field section of themain.tf
file. This mapping field can be expanded to include additional regions and AZs where Trn1 availability exists.We can do better in write a documentation to explain this to users