-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AZ mapping Code changes #421
Conversation
simplified docker build for neuronx-nemo-megatron container added scripts for cli pod launch, precompilation, training added script for tensorboard deployment
Llama updates
bug fix - always store ecr repo uri
…er's chosen region
Fix trn1 nodegroups so they use the preferred subnet/AZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments on spelling errors otherwise its good to go
ai-ml/trainium-inferentia/eks.tf
Outdated
@@ -46,7 +46,7 @@ module "eks" { | |||
|
|||
# security group rule from all ipv4 to nodes for port 22 | |||
node_security_group_additional_rules = { | |||
# Critical Security group rule for EFA enabled nodes | |||
# Critical Secruity group rule for EFA enabled nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for spelling error and revert this change.
ai-ml/trainium-inferentia/eks.tf
Outdated
@@ -56,7 +56,7 @@ module "eks" { | |||
self = true | |||
} | |||
|
|||
# Critical Security group rule for EFA enabled nodes | |||
# Critical Secruity group rule for EFA enabled nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for spelling error and revert this change.
ai-ml/trainium-inferentia/eks.tf
Outdated
@@ -248,7 +248,7 @@ module "eks" { | |||
} | |||
] | |||
|
|||
# Commented to investigate further as the node group creation is failing with placement group | |||
# Commented to investigate further as the node group creation is failing with palcement group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for spelling error and revert this change.
ai-ml/trainium-inferentia/eks.tf
Outdated
@@ -458,7 +457,7 @@ module "eks" { | |||
}, | |||
] | |||
|
|||
# Commented to investigate further as the node group creation is failing with placement group | |||
# Commented to investigate further as the node group creation is failing with palcement group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for spelling error and revert this change.
# The preferred AZ is the first AZ listed in the AZ id <-> region mapping in main.tf. | ||
# We use index 2 to select the subnet in AZ1 with the 100.x CIDR: | ||
# module.vpc.private_subnets = [AZ1_10.x, AZ2_10.x, AZ1_100.x, AZ2_100.x] | ||
subnet_ids = [module.vpc.private_subnets[2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is launching them in the same subnet a requirement? This requirement would make it actually more brittle as a generic blueprint, as instance availability is highly dynamic and varies across AZs across different regions. Ideally all available subnets would be supplied so that EC2 Auto Scaling and/or Karpenter can retry in different subnets on failure due to unavailabilty or lack of support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These workloads are intended to operate within a single subnet, and our previous approach also filters to the first private subnet. I recognize that a key challenge here is the availability of Trn1 instances in specific regions and AZs.
@sanjeevrg89's solution addresses this by allowing selection of only those regions and AZs where Trn1
instances are available, as outlined in the az_mapping
field section of the main.tf
file. This mapping field can be expanded to include additional regions and AZs where Trn1 availability exists.
We can do better in write a documentation to explain this to users
pick cd7ab810 Pre-commit updates pick 6e61ad4b Added requirements into add-ons and removed from default vcalues
What does this PR do?
3 files have changed: main.tf, eks.tf and install-pre-requsites-for-ec2.sh
🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.
AZ mapping change was made as users were reporting issues related to provisioning trn1 instances.
Motivation
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes