Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

sindhupalakodety · 2025-01-15T04:56:30Z

What does this PR do?

Example showing a combination of technologies such as Ray + PTL + Neuron for pre-training llama3.1 model on Trn1 instances. This example was requested by multiple customers.

The integration of Ray, PyTorch Lightning (PTL), and AWS Neuron combines PTL's intuitive model development API, Ray Train's robust distributed computing capabilities for seamless scaling across multiple nodes, and AWS Neuron's hardware optimization for Trainium, significantly simplifying the setup and management of distributed training environments for large-scale AI projects, particularly those involving computationally intensive tasks like large language models.

Motivation

Issue: #724

More

[x ] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
[ x] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
[ x] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

[ x] E2E Test successfully complete before merge?

Additional Notes

We tested this out for a customer use-case and even demoed the solution to the customer.
The customer was impressed with the results.

vara-bonthu · 2025-02-04T18:35:31Z

@omrishiv Would you be able to review or find someone who can do that? Thanks

omrishiv · 2025-02-05T22:23:59Z

Yes, I can take a look. Hopefully by EoW

omrishiv

Thank you for putting this together. I left a few comments to start with. I'm wondering though if this is similar to optimum-neuron? Have you tried that? If it's similar, is it possible to reuse some of that framework without as many static files?

omrishiv · 2025-02-06T16:41:47Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/1-pretrain-trn1-raycluster.yaml

+                mountPath: /shared
+        # Node Selector for Karpenter
+        # Karpenter will provision this head pod on a node with the specified labels.
+        nodeSelector:


Are these necessary? The pod won't land on a non-cpu node due to taints, and the keys/values may be different in other deployments.

omrishiv · 2025-02-06T16:42:16Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/1-pretrain-trn1-raycluster.yaml

+            persistentVolumeClaim:
+              claimName: fsx-claim   # Reference the PVC for shared storage
+    rayStartParams:
+      dashboard-host: 0.0.0.0    # Make dashboard accessible


pleasse set num-cpus: 0 so we don't schedule actors on the head node

omrishiv · 2025-02-06T16:43:54Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/1-pretrain-trn1-raycluster.yaml

+                  name: log-volume     # Mount for Ray logs
+          # Node Selector for Managed Node Group (with Cluster Autoscaler)
+          # These workers will run on Trn1 instances provisioned by the cluster autoscaler.
+          # This is necessary as Karpenter doesn't currently support EFA (required for Neuron distributed training).


Is this true? aws/karpenter-provider-aws#5068 I think you can request the resource.

omrishiv · 2025-02-06T16:44:41Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/1-pretrain-trn1-raycluster.yaml

+            - key: "aws.amazon.com/neuron"
+              operator: "Exists"
+              effect: "NoSchedule"
+            - key: "hub.jupyter.org/dedicated"


If we are trying to use the jupyter taints to keep other pods off of jupyter nodes, we shouldn't add a toleration for it

omrishiv · 2025-02-06T16:46:33Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/3-parallel-compile-trn1-rayjob.yaml

+  name: llama3.1-parallel-compile-job
+spec:
+  submissionMode: K8sJobMode
+  entrypoint: "NEURON_NUM_DEVICES=32 bash run_llama3.1_8b.sh -r 2 -n 16 -l 4e-4 -s 8192 -p 1"


It might be worth extracting these variables to ENVVARs, that way you can do some hyperparameter tuning or otherwise update by mounting them directly from a configfile in one place, just a thought.

omrishiv · 2025-02-06T16:50:05Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/build_docker.sh

+# Login to ECR
+echo -e "\nLogging in to ECR"
+aws ecr get-login-password --region "$region" | docker login --username AWS --password-stdin $ECR_REPO_URI
+aws ecr get-login-password --region "$region" | docker login --username AWS --password-stdin 763104351884.dkr.ecr.${region}.amazonaws.com/pytorch-training-neuronx


You have the region here set dynamically, but it is hardcoded in the image: key in the yaml deployment. Please double check

omrishiv · 2025-02-06T16:53:01Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/llama3.1_pretraining/ray_train_llama3.py

+import ray
+from ray.train import ScalingConfig
+from ray.train.torch import TorchTrainer
+# from ray.train.torch.xla import TorchXLAConfig


If you don't need, please remove

omrishiv · 2025-02-06T16:53:22Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/llama3.1_pretraining/requirements.txt

@@ -0,0 +1,9 @@
+pytorch-lightning


please freeze all dependencies, this may break otherwise

omrishiv · 2025-02-06T16:53:57Z

gen-ai/training/ray-ptl-llama3.1-pretrain-trn1/llama3.1_pretraining/run_llama3.1_8b.sh

+# warmup steps
+WARMUP_STEPS=100
+# learning rate
+#LR=3.0e-4


commented out code can be removed. same with MODEL_PATH

omrishiv · 2025-02-06T16:56:10Z

website/docs/gen-ai/training/Neuron/RayPTLNeuron-Llama3.1.md

@@ -0,0 +1,362 @@
+---
+sidebar_position: 1


Why are we repositioning all of the documents?

sindhupalakodety added 4 commits January 14, 2025 19:49

code for the issue 724 for data-on-eks

6b03601

Merge branch 'awslabs:main' into main

eedcded

commiting the code for issue 724

694b777

Merge branch 'main' of github.com:sindhupalakodety/data-on-eks

b69c12f

omrishiv suggested changes Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

sindhupalakodety commented Jan 15, 2025

vara-bonthu commented Feb 4, 2025

omrishiv commented Feb 5, 2025

omrishiv left a comment

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

omrishiv Feb 6, 2025

		@@ -0,0 +1,362 @@
		---
		sidebar_position: 1

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

Are you sure you want to change the base?

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

Conversation

sindhupalakodety commented Jan 15, 2025

What does this PR do?

Motivation

More

For Moderators

Additional Notes

vara-bonthu commented Feb 4, 2025

omrishiv commented Feb 5, 2025

omrishiv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment