Name	Name	Last commit message	Last commit date
parent directory ..
images	images
llama3_finetune	llama3_finetune
0-kuberay-trn1-llama3-finetune-build-image.sh	0-kuberay-trn1-llama3-finetune-build-image.sh
1-llama3-finetune-trn1-create-raycluster.yaml	1-llama3-finetune-trn1-create-raycluster.yaml
2-llama3-finetune-trn1-rayjob-create-data.yaml	2-llama3-finetune-trn1-rayjob-create-data.yaml
3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml	3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml
Dockerfile	Dockerfile
README.md	README.md

Instructions for fine-tuning LLama3.1 on AWS Trainium using Ray + Pytorch Lightning + Neuron

Overview

This tutorial shows how to launch a Ray + PTL + Neuron training job on multiple Trn1 nodes within an Amazon Elastic Kubernetes Service (EKS) cluster. In this example, the Llama3.1 8B model will undergo fine-tuning using the opensource dataset: Hugging face databricks/databricks-dolly-15k. Ray will be used to launch the job on 2 trn1.32xlarge (or trn1n.32xlarge) instances, with 32 cores per instance.

What are Ray, PTL and Neuron?

PyTorch Lightning developed by Lightning AI organization, is a library that provides a high-level interface for PyTorch, and helps you organize your code and reduce boilerplate. By abstracting away engineering code, it makes deep learning experiments easier to reproduce and improves developer productivity.

Ray enhances ML workflows by seamlessly scaling fine-tuning and inference across distributed clusters, transforming single-node code into high-performance, multi-node operations with minimal effort.

AWS Neuron is an SDK with a compiler, runtime, and profiling tools that unlocks high-performance and cost-effective deep learning (DL) acceleration. It supports high-performance training on AWS Trainium instances. For model deployment, it supports high-performance and low-latency inference on AWS Inferentia.

Combining Ray + PTL + Neuron:

The integration of Ray, PyTorch Lightning (PTL), and AWS Neuron combines PTL's intuitive model development API, Ray Train's robust distributed computing capabilities for seamless scaling across multiple nodes, and AWS Neuron's hardware optimization for Trainium, significantly simplifying the setup and management of distributed training environments for large-scale AI projects, particularly those involving computationally intensive tasks like large language models.

The tutorial covers all steps required to prepare the EKS environment and launch the training job:

Sandbox Environment
Setup EKS cluster and tools
Create ECR repo and upload docker image
Creating Ray Cluster
Preparing Data
Monitoring Jobs
Fine-tuning Model
Deleting the environment
Troubleshooting

Multi-Node Ray + PTL + Neuron Flow

1. Sandbox Environment

1.1 Launch a Linux jump host

Supported Regions: Begin by choosing an AWS region that supports both EKS and Trainium (ex: us-west-2 / us-east-1 / us-east-2).

In your chosen region (for ex: us-east-2), use the AWS Console or AWS CLI to launch an instance with the following configuration:

Instance Type: m5.large
AMI: Amazon Linux 2023 AMI (HVM)
Key pair name: (choose a key pair that you have access to)
Auto-assign public IP: Enabled
Storage: 100 GiB root volume

1.2 Configure AWS credentials on the jump host

Create a new IAM user in the AWS Console:

Refer to the AWS IAM documentation in order to create a new IAM user with the following parameters:

User name: eks_tutorial
Select AWS credential type: enable Access key - Programmatic access
Permissions: choose Attach existing policies directly and then select AdministratorAccess

Be sure to record the ACCESS_KEY_ID and SECRET_ACCESS_KEY that were created for the new IAM user.

Log into your jump host instance using one of the following techniques:

Connect to your instance via the AWS Console using EC2 Instance Connect
SSH to your instance's public IP using the key pair you specified above.
- Ex: ssh -i KEYPAIR.pem ec2-user@INSTANCE_PUBLIC_IP_ADDRESS

Configure the AWS CLI with your IAM user's credentials:

Run aws configure, entering the ACCESS_KEY_ID and SECRET_ACCESS_KEY you recorded above. For Default region name be sure to specify the same region used to launch your jump host, ex: us-east-2.

bash> aws configure
AWS Access Key ID [None]:  ACCESS_KEY_ID
AWS Secret Access Key [None]: SECRET_ACCESS_KEY
Default region name [None]: us-east-2
Default output format [None]:

2. Setup EKS cluster and tools

2.1 Setup kubectl, docker, terraform on your jump host

Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your jump host.

Automation for Pre-requisities:
To install all the pre-reqs above on the jump host, you can run this script which is compatible with Amazon Linux 2023.

2.2 Clone the Data on EKS repository

cd ~
git clone https://github.com/awslabs/data-on-eks.git

2.3 Setup EKS Cluster

Navigate to the trainium-inferentia directory:

cd data-on-eks/ai-ml/trainium-inferentia

Let's run the below export commands to set environment variables.

# Enable FSx for Lustre, which will mount fine-tuning data to all pods across multiple nodes
export TF_VAR_enable_fsx_for_lustre=true

# Set the region according to your requirements. Check Trn1 instance availability in the specified region.
export TF_VAR_region=us-east-2

# Enable Volcano custom scheduler with KubeRay Operator
export TF_VAR_enable_volcano=true

# Note: This configuration will create two new Trn1 32xl instances. Ensure you validate the associated costs before proceeding. You can change the number of instances here.
export TF_VAR_trn1_32xl_min_size=2
export TF_VAR_trn1_32xl_desired_size=2

Run the installation script to provision an EKS cluster with all the add-ons needed for the solution.

./install.sh

2.4 Verify the resources

Verify the Amazon EKS Cluster:

aws eks --region us-east-2 describe-cluster --name trainium-inferentia

# Creates k8s config file to authenticate with EKS
aws eks --region us-east-2 update-kubeconfig --name trainium-inferentia

kubectl get nodes # Output shows the EKS Managed Node group nodes

2.5 Verify if the Neuron Device Plugin is running

Use the following kubectl command:

kubectl get ds neuron-device-plugin --namespace kube-system
NAME                           DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
neuron-device-plugin-daemonset 2         2      2        2          2       17d

2.6 Verify that the EKS cluster has allocatable Neuron cores and devices

Use the following kubectl command:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"
NAME NeuronCore
ip-192-168-65-41.us-west-2.compute.internal 32
ip-192-168-87-81.us-west-2.compute.internal 32

3. Create ECR repo and upload docker image to ECR

3.1 Clone this repo

On your jump host:

sudo yum install -y git
git clone https://github.com/aws-neuron/aws-neuron-eks-samples.git
cd aws-neuron-eks-samples/llama3.1_8B_finetune_ray_ptl_neuron

3.2 Execute the script

The script 0-kuberay-trn1-llama3-finetune-build-image.sh checks if the ECR repo kuberay_trn1_llama3.1_pytorch2 exists in the AWS Account and creates it if it does not exist.

This script also builds the docker image and uploads the image to this repo.

bash> chmod +x 0-kuberay-trn1-llama3-finetune-build-image.sh
bash> ./0-kuberay-trn1-llama3-finetune-build-image.sh
bash> Enter the appropriate AWS region: For example: us-east-2

If you have required credentials, the docker image should be successfully created and uploaded to Amazon ECR in the repository in the specific AWS region.

Verify if the repository kuberay_trn1_llama3.1_pytorch2 is created successfully by heading to Amazon ECR service in AWS Console.

4. Creating Ray cluster

The script 1-llama3-finetune-trn1-create-raycluster.yaml creates Ray cluster with a head pod and worker pods.

Update the <AWS_ACCOUNT_ID> and <REGION> fields in the 1-llama3-finetune-trn1-create-raycluster.yaml file using commands below (to reflect the correct ECR image ARN created above):

bash> export AWS_ACCOUNT_ID=<enter_your_aws_account_id> # for ex: 111222333444
bash> export REGION=<enter_your_aws_region> # for ex: us-east-2
bash> sed -i "s/<AWS_ACCOUNT_ID>/$AWS_ACCOUNT_ID/g" 1-llama3-finetune-trn1-create-raycluster.yaml
bash> sed -i "s/<REGION>/$REGION/g" 1-llama3-finetune-trn1-create-raycluster.yaml

Use the command below to create Ray cluster:

kubectl apply -f 1-llama3-finetune-trn1-create-raycluster.yaml
kubectl get pods # Ensure all head and worker pods are in Running state

The Ray cluster contains 1 head pod and 2 worker pods. Worker pods are deployed on the 2 Trainium instances (trn1.32xlarge).

5. Preparing data

Use the command below to submit a Ray job for downloading the databricks/databricks-dolly-15k dataset and the Llama3.1 8B model:

kubectl apply -f 2-llama3-finetune-trn1-rayjob-create-data.yaml

You can check the output of kubectl get pods to find out if the job has completed:

kubectl get pods
NAME                                              READY   STATUS      RESTARTS   AGE
2-llama3-finetune-trn1-rayjob-create-data-8qjfk   0/1     Completed   0          7m
cmd-shell                                         1/1     Running     0          10d
kuberay-trn1-head-zplg7                           1/1     Running     0          14m
kuberay-trn1-worker-workergroup-lwc2f             1/1     Running     0          14m
kuberay-trn1-worker-workergroup-zsm2z             1/1     Running     0          14m

6. Monitoring jobs via Ray Dashboard

To view the Ray dashboard from the browser in your local machine:

kubectl port-forward service/kuberay-trn1-head-svc 8265:8265 &
Head to: http://localhost:8265/ on your local browser.

You can monitor the progress of the job in Ray Dashboard.

7. Fine-tuning Llama3.1 8B model

Use the command below to submit a Ray job for fine-tuning the model:

kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml

Known Issues: If the Ray job fails with punkt or division by zero errors, see the Troubleshooting section below.

Model artifacts will be created under /shared/neuron_compile_cache/. Check the Ray logs for “Training Completed” message.

8. Clean-up

When you are finished with the tutorial, run the following commands on the jump host to remove the EKS cluster and associated resources:

# Delete Ray Jobs
kubectl delete -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml
kubectl delete -f 2-llama3-finetune-trn1-rayjob-create-data.yaml

# Delete Ray Cluster
kubectl delete -f 1-llama3-finetune-trn1-create-raycluster.yaml

# Delete ECR Repo
Head to the AWS console and delete the ECR repo: kuberay_trn1_llama3.1_pytorch2

# Clean Up the EKS Cluster and Associated Resources:
cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh

Terminate your EC2 jump host instance 

Delete the eks_tutorial IAM user via the AWS Console.

9. Troubleshooting

Known Issues:

If the job fails with the Errors below:

[36m(RayTrainWorker pid=3462, ip=100.64.83.225)[0m [nltk_data] '/root/nltk_data/tokenizers/punkt_tab.zip'
[36m(RayTrainWorker pid=3464, ip=100.64.83.225)[0m [nltk_data] Error with downloaded zip file
[36m(RayTrainWorker pid=3483, ip=100.64.83.225)[0m Bad CRC-32 for file 'punkt_tab/czech/ortho_context.tab'

File "/tmp/ray/session_2024-11-13_06-40-30_347972_17/runtime_resources/working_dir_files/_ray_pkg_5ad2ee50e13a7e91/ray_neuron_xla_config_20.py", line 20, in _set_xla_env_vars
"GROUP_WORLD_SIZE": str(context.get_world_size() / local_world_size),
ZeroDivisionError: division by zero

Workaround: If your Ray fine-tuning job fails with errors associated with punkt or division by zero, delete the Ray job using the commands below, wait for 5 min and re-run it. If the job fails again, wait for 5 more min and re-run the second time.

kubectl delete -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml
kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml

If you still face issues, reach out to us via the documentation. To report any bugs, raise an issue via the GitHub Issues feature.

Probable Cause: Punkt is a tokenizer used in Natural Language Processing (NLP) that is part of the NLTK (Natural Language Toolkit) library in Python. These errors above seem to be associated with the code trying to use Punkt libraries before they have completely downloaded. We are actively investigating this issue. Till then, follow the workaround above.

Contributors

Pradeep Kadubandi - AWS ML Engineer
Chakra Nagarajan - AWS Principal Specialist SA - Accelerated Computing
Sindhura Palakodety - AWS Senior ISV Generative AI Solutions Architect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3.1_8B_finetune_ray_ptl_neuron

llama3.1_8B_finetune_ray_ptl_neuron

README.md

Instructions for fine-tuning LLama3.1 on AWS Trainium using Ray + Pytorch Lightning + Neuron

Overview

What are Ray, PTL and Neuron?

Combining Ray + PTL + Neuron:

Multi-Node Ray + PTL + Neuron Flow

1. Sandbox Environment

1.1 Launch a Linux jump host

1.2 Configure AWS credentials on the jump host

Create a new IAM user in the AWS Console:

Log into your jump host instance using one of the following techniques:

Configure the AWS CLI with your IAM user's credentials:

2. Setup EKS cluster and tools

2.1 Setup kubectl, docker, terraform on your jump host

2.2 Clone the Data on EKS repository

2.3 Setup EKS Cluster

2.4 Verify the resources

2.5 Verify if the Neuron Device Plugin is running

2.6 Verify that the EKS cluster has allocatable Neuron cores and devices

3. Create ECR repo and upload docker image to ECR

3.1 Clone this repo

3.2 Execute the script

4. Creating Ray cluster

5. Preparing data

6. Monitoring jobs via Ray Dashboard

7. Fine-tuning Llama3.1 8B model

8. Clean-up

9. Troubleshooting

Contributors

Files

llama3.1_8B_finetune_ray_ptl_neuron

Directory actions

More options

Directory actions

More options

Latest commit

History

llama3.1_8B_finetune_ray_ptl_neuron

Folders and files

parent directory

README.md

Instructions for fine-tuning LLama3.1 on AWS Trainium using Ray + Pytorch Lightning + Neuron

Overview

What are Ray, PTL and Neuron?

Combining Ray + PTL + Neuron:

Multi-Node Ray + PTL + Neuron Flow

1. Sandbox Environment

1.1 Launch a Linux jump host

1.2 Configure AWS credentials on the jump host

Create a new IAM user in the AWS Console:

Log into your jump host instance using one of the following techniques:

Configure the AWS CLI with your IAM user's credentials:

2. Setup EKS cluster and tools

2.1 Setup kubectl, docker, terraform on your jump host

2.2 Clone the Data on EKS repository

2.3 Setup EKS Cluster

2.4 Verify the resources

2.5 Verify if the Neuron Device Plugin is running

2.6 Verify that the EKS cluster has allocatable Neuron cores and devices

3. Create ECR repo and upload docker image to ECR

3.1 Clone this repo

3.2 Execute the script

4. Creating Ray cluster

5. Preparing data

6. Monitoring jobs via Ray Dashboard

7. Fine-tuning Llama3.1 8B model

8. Clean-up

9. Troubleshooting

Contributors