This tutorial shows how to launch a Ray + PTL + Neuron training job on multiple Trn1 nodes within an Amazon Elastic Kubernetes Service (EKS) cluster. In this example, the Llama3.1 8B model will undergo fine-tuning using the opensource dataset: Hugging face databricks/databricks-dolly-15k. Ray will be used to launch the job on 2 trn1.32xlarge (or trn1n.32xlarge) instances, with 32 cores per instance.
PyTorch Lightning developed by Lightning AI organization, is a library that provides a high-level interface for PyTorch, and helps you organize your code and reduce boilerplate. By abstracting away engineering code, it makes deep learning experiments easier to reproduce and improves developer productivity.
Ray enhances ML workflows by seamlessly scaling fine-tuning and inference across distributed clusters, transforming single-node code into high-performance, multi-node operations with minimal effort.
AWS Neuron is an SDK with a compiler, runtime, and profiling tools that unlocks high-performance and cost-effective deep learning (DL) acceleration. It supports high-performance training on AWS Trainium instances. For model deployment, it supports high-performance and low-latency inference on AWS Inferentia.
The integration of Ray, PyTorch Lightning (PTL), and AWS Neuron combines PTL's intuitive model development API, Ray Train's robust distributed computing capabilities for seamless scaling across multiple nodes, and AWS Neuron's hardware optimization for Trainium, significantly simplifying the setup and management of distributed training environments for large-scale AI projects, particularly those involving computationally intensive tasks like large language models.
The tutorial covers all steps required to prepare the EKS environment and launch the training job:
- Sandbox Environment
- Setup EKS cluster and tools
- Create ECR repo and upload docker image
- Creating Ray Cluster
- Preparing Data
- Monitoring Jobs
- Fine-tuning Model
- Deleting the environment
- Troubleshooting
Supported Regions: Begin by choosing an AWS region that supports both EKS and Trainium (ex: us-west-2 / us-east-1 / us-east-2).
In your chosen region (for ex: us-east-2), use the AWS Console or AWS CLI to launch an instance with the following configuration:
- Instance Type: m5.large
- AMI: Amazon Linux 2023 AMI (HVM)
- Key pair name: (choose a key pair that you have access to)
- Auto-assign public IP: Enabled
- Storage: 100 GiB root volume
Refer to the AWS IAM documentation in order to create a new IAM user with the following parameters:
- User name:
eks_tutorial
- Select AWS credential type: enable
Access key - Programmatic access
- Permissions: choose Attach existing policies directly and then select
AdministratorAccess
Be sure to record the ACCESS_KEY_ID and SECRET_ACCESS_KEY that were created for the new IAM user.
- Connect to your instance via the AWS Console using EC2 Instance Connect
- SSH to your instance's public IP using the key pair you specified above.
- Ex:
ssh -i KEYPAIR.pem ec2-user@INSTANCE_PUBLIC_IP_ADDRESS
- Ex:
Run aws configure
, entering the ACCESS_KEY_ID and SECRET_ACCESS_KEY you recorded above. For Default region name be sure to specify the same region used to launch your jump host, ex: us-east-2
.
bash> aws configure AWS Access Key ID [None]: ACCESS_KEY_ID AWS Secret Access Key [None]: SECRET_ACCESS_KEY Default region name [None]: us-east-2 Default output format [None]:
Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your jump host.
Automation for Pre-requisities:
To install all the pre-reqs above on the jump host, you can run this script which is compatible with Amazon Linux 2023.
cd ~
git clone https://github.com/awslabs/data-on-eks.git
Navigate to the trainium-inferentia directory:
cd data-on-eks/ai-ml/trainium-inferentia
Let's run the below export commands to set environment variables.
# Enable FSx for Lustre, which will mount fine-tuning data to all pods across multiple nodes
export TF_VAR_enable_fsx_for_lustre=true
# Set the region according to your requirements. Check Trn1 instance availability in the specified region.
export TF_VAR_region=us-east-2
# Enable Volcano custom scheduler with KubeRay Operator
export TF_VAR_enable_volcano=true
# Note: This configuration will create two new Trn1 32xl instances. Ensure you validate the associated costs before proceeding. You can change the number of instances here.
export TF_VAR_trn1_32xl_min_size=2
export TF_VAR_trn1_32xl_desired_size=2
Run the installation script to provision an EKS cluster with all the add-ons needed for the solution.
./install.sh
Verify the Amazon EKS Cluster:
aws eks --region us-east-2 describe-cluster --name trainium-inferentia
# Creates k8s config file to authenticate with EKS
aws eks --region us-east-2 update-kubeconfig --name trainium-inferentia
kubectl get nodes # Output shows the EKS Managed Node group nodes
Use the following kubectl command:
kubectl get ds neuron-device-plugin --namespace kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE neuron-device-plugin-daemonset 2 2 2 2 2 17d
Use the following kubectl command:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore" NAME NeuronCore ip-192-168-65-41.us-west-2.compute.internal 32 ip-192-168-87-81.us-west-2.compute.internal 32
On your jump host:
sudo yum install -y git git clone https://github.com/aws-neuron/aws-neuron-eks-samples.git cd aws-neuron-eks-samples/llama3.1_8B_finetune_ray_ptl_neuron
The script 0-kuberay-trn1-llama3-finetune-build-image.sh
checks if the ECR repo kuberay_trn1_llama3.1_pytorch2
exists in the AWS Account and creates it if it does not exist.
This script also builds the docker image and uploads the image to this repo.
bash> chmod +x 0-kuberay-trn1-llama3-finetune-build-image.sh bash> ./0-kuberay-trn1-llama3-finetune-build-image.sh bash> Enter the appropriate AWS region: For example: us-east-2
If you have required credentials, the docker image should be successfully created and uploaded to Amazon ECR in the repository in the specific AWS region.
Verify if the repository kuberay_trn1_llama3.1_pytorch2
is created successfully by heading to Amazon ECR service in AWS Console.
The script 1-llama3-finetune-trn1-create-raycluster.yaml
creates Ray cluster with a head pod and worker pods.
Update the <AWS_ACCOUNT_ID>
and <REGION>
fields in the 1-llama3-finetune-trn1-create-raycluster.yaml
file using commands below (to reflect the correct ECR image ARN created above):
bash> export AWS_ACCOUNT_ID=<enter_your_aws_account_id> # for ex: 111222333444 bash> export REGION=<enter_your_aws_region> # for ex: us-east-2 bash> sed -i "s/<AWS_ACCOUNT_ID>/$AWS_ACCOUNT_ID/g" 1-llama3-finetune-trn1-create-raycluster.yaml bash> sed -i "s/<REGION>/$REGION/g" 1-llama3-finetune-trn1-create-raycluster.yaml
Use the command below to create Ray cluster:
kubectl apply -f 1-llama3-finetune-trn1-create-raycluster.yaml kubectl get pods # Ensure all head and worker pods are in Running state
The Ray cluster contains 1 head pod and 2 worker pods. Worker pods are deployed on the 2 Trainium instances (trn1.32xlarge).
Use the command below to submit a Ray job for downloading the databricks/databricks-dolly-15k dataset and the Llama3.1 8B model:
kubectl apply -f 2-llama3-finetune-trn1-rayjob-create-data.yaml
You can check the output of kubectl get pods
to find out if the job has completed:
kubectl get pods NAME READY STATUS RESTARTS AGE 2-llama3-finetune-trn1-rayjob-create-data-8qjfk 0/1 Completed 0 7m cmd-shell 1/1 Running 0 10d kuberay-trn1-head-zplg7 1/1 Running 0 14m kuberay-trn1-worker-workergroup-lwc2f 1/1 Running 0 14m kuberay-trn1-worker-workergroup-zsm2z 1/1 Running 0 14m
To view the Ray dashboard from the browser in your local machine:
kubectl port-forward service/kuberay-trn1-head-svc 8265:8265 & Head to: http://localhost:8265/ on your local browser.
You can monitor the progress of the job in Ray Dashboard.
Use the command below to submit a Ray job for fine-tuning the model:
kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml
Known Issues: If the Ray job fails with punkt or division by zero errors, see the Troubleshooting section below.
Model artifacts will be created under /shared/neuron_compile_cache/
. Check the Ray logs for “Training Completed” message.
When you are finished with the tutorial, run the following commands on the jump host to remove the EKS cluster and associated resources:
# Delete Ray Jobs
kubectl delete -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml
kubectl delete -f 2-llama3-finetune-trn1-rayjob-create-data.yaml
# Delete Ray Cluster
kubectl delete -f 1-llama3-finetune-trn1-create-raycluster.yaml
# Delete ECR Repo
Head to the AWS console and delete the ECR repo: kuberay_trn1_llama3.1_pytorch2
# Clean Up the EKS Cluster and Associated Resources:
cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh
Terminate your EC2 jump host instance
Delete the eks_tutorial IAM user via the AWS Console.
Known Issues:
If the job fails with the Errors below:
[36m(RayTrainWorker pid=3462, ip=100.64.83.225)[0m [nltk_data] '/root/nltk_data/tokenizers/punkt_tab.zip' [36m(RayTrainWorker pid=3464, ip=100.64.83.225)[0m [nltk_data] Error with downloaded zip file [36m(RayTrainWorker pid=3483, ip=100.64.83.225)[0m Bad CRC-32 for file 'punkt_tab/czech/ortho_context.tab'
File "/tmp/ray/session_2024-11-13_06-40-30_347972_17/runtime_resources/working_dir_files/_ray_pkg_5ad2ee50e13a7e91/ray_neuron_xla_config_20.py", line 20, in _set_xla_env_vars "GROUP_WORLD_SIZE": str(context.get_world_size() / local_world_size), ZeroDivisionError: division by zero
Workaround:
If your Ray fine-tuning job fails with errors associated with punkt
or division by zero
, delete the Ray job using the commands below, wait for 5 min and re-run it. If the job fails again, wait for 5 more min and re-run the second time.
kubectl delete -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml
If you still face issues, reach out to us via the documentation. To report any bugs, raise an issue via the GitHub Issues feature.
Probable Cause: Punkt is a tokenizer used in Natural Language Processing (NLP) that is part of the NLTK (Natural Language Toolkit) library in Python. These errors above seem to be associated with the code trying to use Punkt libraries before they have completely downloaded. We are actively investigating this issue. Till then, follow the workaround above.
Pradeep Kadubandi - AWS ML Engineer
Chakra Nagarajan - AWS Principal Specialist SA - Accelerated Computing
Sindhura Palakodety - AWS Senior ISV Generative AI Solutions Architect