Tutorial: Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS


This tutorial shows how to launch a distributed PyTorch Neuron training job on multiple Trn1 nodes within an Amazon Elastic Kubernetes Service (EKS) cluster. In this example, the BERT-large model will undergo DataParallel-based phase1 pretraining using the WikiCorpus dataset. TorchX will be used to launch the job on 2 trn1.32xlarge (or trn1n.32xlarge) instances, with 32 workers per instance.

The tutorial covers all steps required to prepare the EKS environment and launch the training job:

  1. Sandbox setup
  2. Cluster and Tools
  3. Training Job preparation & Launch
  4. Monitoring Training
  5. Deleting the environment

Multi-Node PyTorch Neuron Flow

Architecture Diagram

1. Sandbox Setup

This tutorial assumes that you will use an x86-based Linux jump host to launch and manage the EKS cluster and PyTorch Neuron training jobs.

Note: It is highly recommended that you use the same IAM user/role to create and manage your EKS cluster. If you create your EKS cluster using one user/role and then attempt to manage it using a different user/role within the same account, you will need to modify your EKS configuration to provide system-masters access to the 2nd user/role.

If you prefer to use your local computer instead of a jump host, please ensure that your computer is x86-based. Attempting to launch training jobs via TorchX from a non-x86 host (ex: an ARM-based M1 Mac) will lead to errors because the resulting Docker containers will be built for the wrong architecture.

1.1 Launch a Linux jump host

Begin by choosing an AWS region that supports both EKS and Trainium (ex: us-east-1, us-west-2). In this tutorial we will assume the use of us-west-2.

In your chosen region, use the AWS Console or AWS CLI to launch an instance with the following configuration:

  • Instance Type: t3.large
  • AMI: Amazon Linux 2 AMI (HVM)
  • Key pair name: (choose a key pair that you have access to)
  • Auto-assign public IP: Enabled
  • Storage: 100 GiB root volume

1.2 Configure AWS credentials on the jump host

Create a new IAM user in the AWS Console

Refer to the AWS IAM documentation in order to create a new IAM user with the following parameters:

  • User name: eks_tutorial
  • Select AWS credential type: enable Access key - Programmatic access
  • Permissions: choose Attach existing policies directly and then select AdministratorAccess

Be sure to record the ACCESS_KEY_ID and SECRET_ACCESS_KEY that were created for the new IAM user.

Log into your jump host instance using one of the following techniques

  • Connect to your instance via the AWS Console using EC2 Instance Connect
  • SSH to your instance's public IP using the key pair you specified above.
    • Ex: ssh -i KEYPAIR.pem ec2-user@INSTANCE_PUBLIC_IP_ADDRESS

Configure the AWS CLI with your IAM user's credentials

Run aws configure, entering the ACCESS_KEY_ID and SECRET_ACCESS_KEY you recorded above. For Default region name be sure to specify the same region used to launch your jump host, ex: us-west-2.

bash> aws configure
AWS Access Key ID [None]:  ACCESS_KEY_ID
AWS Secret Access Key [None]: SECRET_ACCESS_KEY
Default region name [None]: us-west-2
Default output format [None]: json

Clone this repo to your jump host

sudo yum install -y git
git clone
cd aws-neuron-eks-samples/dp_bert_hf_pretrain

2. Cluster and Tools

2.1 Install and configure eksctl, kubectl, and Docker on the jump host

Install eksctl using the following commands

curl --silent --location "$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

Run eksctl version to confirm that eksctl has been installed correctly:

bash> eksctl version

Install kubectl using the following commands

curl -o kubectl
chmod u+x kubectl
sudo mv kubectl /usr/local/bin

Run kubectl version --short 2>&1 | grep Client to confirm that kubectl has been installed correctly:

bash> kubectl version --short 2>&1 | grep Client
Client Version: v1.25.7-eks-a59e1f0

Note: The above commands will install kubectl version 1.25.7. If you require a different version of kubectl, please refer to the EKS documentation.

Install Docker using the following commands

sudo yum install -y docker jq
sudo service docker start
sudo usermod -aG docker ec2-user

Note: You will need to disconnect/reconnect to your jump host (or run newgrp docker) before you will be able to run any Docker commands on the jump host.

Install and configure docker-credential-ecr-login

TorchX depends on the docker-credential-ecr-login helper to authenticate with your ECR repository in order to push/pull container images. Run the following commands to install and configure the credential helper:

mkdir -p ~/.docker
cat <<EOF > ~/.docker/config.json
    "credsStore": "ecr-login"

sudo yum install -y amazon-ecr-credential-helper

2.2 Create ECR repo

When working with TorchX, a container repository is required to host the container images that are used to launch and run training jobs. Run the following command to create a new Elastic Container Registry (ECR) repository called eks_torchx_tutorial:

aws ecr create-repository --repository-name eks_torchx_tutorial

Confirm that the repository was successfully created by running:

aws ecr describe-repositories --repository-name eks_torchx_tutorial --query repositories[0].repositoryUri

If successful, the command will output the URI of your new ECR repository:

bash> aws ecr describe-repositories --repository-name eks_torchx_tutorial --query repositories[0].repositoryUri

2.3 Create EKS cluster

Determine which availability zones will be used by EKS

When provisioning an EKS cluster you need to specify 2 availability zones for the cluster. For this tutorial, it is important to choose 2 availability zones that support AWS Trainium. Run following commands to automatically choose the appropriate availability zones for the us-west-2 region:


If the command is successful you will see a message similar to the following:

bash> ./scripts/

Your EKS availability zones are us-west-2d and us-west-2c

Create an EKS cluster manifest by running the following commands

bash> ./scripts/

Successfully wrote eks_cluster.yaml

Examine the EKS cluster manifest

cat eks_cluster.yaml

Your EKS cluster manifest should look similar to the following:

kind: ClusterConfig

  name: my-trn1-cluster
  region: us-west-2
  version: "1.25"

  withOIDC: true

availabilityZones: ["us-west-2d","us-west-2c"]

Create the EKS cluster from the manifest

eksctl create cluster -f eks_cluster.yaml

It may take about 10-12 minutes to create the EKS cluster. Once complete, you will be able to see your new cluster in the output of the eksctl get cluster command:

bash> eksctl get cluster
my-trn1-cluster  us-west-2  True

Create EKS Trn1 Nodegroup resources

Run to create the parameters required for the EKS Nodegroup resources CloudFormation template:


Then create the EKS Nodegroup resources CloudFormation stack using the provided template and newly created parameters file:

aws cloudformation create-stack \
--stack-name eks-trn1-ng-stack \
--template-body file://cfn/eks_trn1_ng_stack.yaml \
--parameters file://cfn_params.json \
--capabilities CAPABILITY_IAM

Run the following command and wait for StackStatus to change from CREATE_IN_PROGRESS to CREATE_COMPLETE. When you see CREATE_COMPLETE, press CTRL-C to return to the bash prompt.

watch -n10 'aws cloudformation describe-stacks --stack-name eks-trn1-ng-stack|grep StackStatus'

Alternatively, you can monitor the status of the eks-trn1-ng-stack stack in the CloudFormation section of the AWS Console, and proceed when the stack shows CREATE_COMPLETE.

2.4 Create the EKS Trn1 Nodegroup

First generate the EKS Nodegroup manifest files:


Next, use eksctl to create the EKS Nodegroup from the manifest. This step will launch a nodegroup consisting of 2 trn1.32xlarge instances and join them to your EKS cluster. Note: If you would like to use trn1n.32xlarge instances (instead of trn1.32xlarge) to take advantage of the additional networking, you can substitute "trn1n_nodegroup.yaml" in the following command:

eksctl create nodegroup -f trn1_nodegroup.yaml

Now confirm that your Trn1 Nodegroup has Status ACTIVE by running the following command:

eksctl get nodegroup --cluster my-trn1-cluster -o yaml
bash> eksctl get nodegroup --cluster my-trn1-cluster -o yaml
- AutoScalingGroupName: eks-trn1-ng1-abcde888-3b4d-9ec8-f72b-389ac36caaaa
  Cluster: my-trn1-cluster
  CreationTime: "2022-12-20T22:43:11.328Z"
  DesiredCapacity: 2
  ImageID: ami-0f7f7ced0aaba64e9
  InstanceType: trn1.32xlarge
  MaxSize: 2
  MinSize: 2
  Name: trn1-ng1
  NodeInstanceRoleARN: arn:aws:iam::XXXXXXXXXXXXXX:role/eksctl-my-trn1-cluster-nodegroup-NodeInstanceRole-RQB6Y4CXSDEK
  StackName: eksctl-my-trn1-cluster-nodegroup-trn1-ng1
  Status: ACTIVE
  Type: managed
  Version: "1.25"

2.5 Install Neuron and EFA k8s plugins

In order to use Trn1 instances with EKS, a few Neuron and EFA plugins are required. Run the following kubectl commands to install the Neuron, Neuron RBAC, and EFA plugins on your new EKS cluster:

kubectl apply -f
kubectl apply -f

helm repo add eks
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin

Next, run kubectl get pods -n kube-system and verify that the EFA and Neuron daemonsets are running:

bash> kubectl get pods -n kube-system
NAME                                        READY   STATUS    RESTARTS   AGE
aws-efa-k8s-device-plugin-daemonset-gpntd   1/1     Running   0          59s
aws-efa-k8s-device-plugin-daemonset-v79qx   1/1     Running   0          59s
aws-node-bgs5l                              1/1     Running   0          14h
aws-node-z6rjf                              1/1     Running   0          14h
coredns-57ff979f67-fm72z                    1/1     Running   0          14h
coredns-57ff979f67-m2mcj                    1/1     Running   0          14h
kube-proxy-7m8zk                            1/1     Running   0          14h
kube-proxy-7rxhq                            1/1     Running   0          14h
neuron-device-plugin-daemonset-cpdz8        1/1     Running   0          51s
neuron-device-plugin-daemonset-nvb8s        1/1     Running   0          51s

Note: If the aws-efa-k8s-device-plugin-daemonset pods indicate an error status of CreateContainerConfigError, please delete and re-apply the daemonset using the following commands and then re-check the daemonset status as indicated above:

helm uninstall aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin

Install and configure TorchX and Volcano

TorchX is a universal launcher for PyTorch jobs, and supports a variety of schedulers including AWS Batch, Docker, Kubernetes, Slurm, Ray, and more.

This tutorial makes use of the Kubernetes scheduler, which depends on the open-source Volcano batch system.

In this section, you will install Volcano and then configure a job queue.

Install Volcano and etcd by running the following commands on the jump host

kubectl apply -f
kubectl apply -f

Create a test queue in Volcano

In order for TorchX to use Volcano at least one job queue must be defined in Volcano. Run the following commands to create a simple test queue:


If the command is successful you will see a message similar to the following: created

If you receive an error stating "no endpoints available for service volcano-admission-service" or "connect: connection refused", please wait a few seconds and retry the ./scripts/ command.

Install TorchX

Use pip to install TorchX on the jump host:

pip3 install torchx[kubernetes]

Install and configure FSx for Lustre CSI

In this tutorial, TorchX is used to launch a DataParallel BERT phase1 pretraining job using 64 workers across 2 trn1.32xlarge (or trn1n.32xlarge) instances (with 32 workers per instance).

BERT phase1 pretraining uses a 50+ GB WikiCorpus dataset as the training dataset. For large datasets such as this, it is inefficient to include the dataset inside the training container image or to download the dataset at the beginning of each training job. A more efficient approach is to use a Kubernetes persistent shared storage volume to host the dataset.

The following steps show how to host the WikiCorpus dataset on a shared volume provided by FSx for Lustre.

Install the FSx for Lustre CSI driver on the EKS cluster

First run the following command to create the appropriate FSX CSI service account on the EKS cluster:


Next, run the following commands to install the FSx for Lustre CSI driver on the EKS cluster:

kubectl apply -k ""

If successful, an entry for will appear in the output of the kubectl get csidriver command:

bash> kubectl get csidriver
NAME              ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE   false            false            false                      false               Persistent   37m   false            false            false                      false               Persistent   16s

Create and apply the storage class manifest for Lustre storage

kubectl apply -f storageclass.yaml

Create and apply the persistent volume claim (PVC) manifest for Lustre storage

kubectl apply -f claim.yaml

Confirm that the persistent volume claim is 'bound' to the EKS cluster

After you apply the above persistent volume claim manifest, an FSx for Lustre filesystem will automatically be provisioned for you. This process will take 5-10 minutes. To monitor the provisioning process, run the following command and wait for the Status field to change from "Pending" to "Bound". When the Status shows as "Bound", you can press CTRL-C to return to the bash prompt.

watch -n10 'kubectl get pvc'
Every 10.0s: kubectl get pvc

fsx-claim   Bound    pvc-abcdabcd   1200Gi     RWX            fsx-sc         6m24s

3. Training Job Preparation & Launch

3.1 Build the BERT pretraining and command shell container images and push them to ECR

Run the following commands on the jump host to build the pretraining and command shell container images and push them into your ECR repository:

ECR_REPO=$(aws ecr describe-repositories --repository-name eks_torchx_tutorial \
    --query repositories[0].repositoryUri --output text)
docker build ./docker -f docker/Dockerfile.bert_pretrain -t $ECR_REPO:bert_pretrain
docker build ./docker -f docker/Dockerfile.cmd_shell -t $ECR_REPO:cmd_shell
docker push $ECR_REPO:bert_pretrain
docker push $ECR_REPO:cmd_shell

3.2 Copy BERT pretraining dataset to the Lustre-hosted persistent volume

Create and apply a manifest for a command shell pod that can be used to access the persistent volume

kubectl apply -f cmd_shell_pod.yaml

Wait for command shell pod to go into Running state

Periodically run kubectl get pods until you see the cmd-shell pod show as Running:

kubectl get pods
bash> kubectl get pods
cmd-shell     1/1     Running   0          2m55s

Open an interactive bash prompt on the command shell pod

When the command shell pod is running, run the following command to open a bash prompt on the pod:

kubectl exec -it cmd-shell -- /bin/bash

Copy and extract the WikiCorpus dataset to the persistent volume

Run the following commands from within the bash prompt on the command shell pod:

cd /data
aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar . --no-sign-request
tar xvf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar

When the above commands have completed, you can exit the command shell pod by typing exit or pressing CTRL-D.

Please delete the cmd-shell app by running:

kubectl delete pod cmd-shell

3.3 Precompile the BERT graphs using neuron_parallel_compile

PyTorch Neuron comes with a tool called neuron_parallel_compile which reduces graph compilation time by extracting model graphs and then compiling the graphs in parallel. The compiled graphs are stored on the shared storage volume where they can be accessed by the worker nodes during model training.

To precompile the BERT graphs, run the following commands:

ECR_REPO=$(aws ecr describe-repositories --repository-name eks_torchx_tutorial \
    --query repositories[0].repositoryUri --output text)

torchx run \
    -s kubernetes --workspace="file:///$PWD/docker" \
    -cfg queue=test,image_repo=$ECR_REPO \
    lib/ \
    --name bertcompile \
    --script_args "--batch_size 16 --grad_accum_usteps 32 \
        --data_dir /data/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128 \
        --output_dir /data/output --steps_this_run 10" \
    --nnodes 2 \
    --nproc_per_node 32 \
    --image $ECR_REPO:bert_pretrain \
    --script \
    --bf16 True \
    --cacheset bert-large \
    --precompile True \
    --instance_type trn1.32xlarge

Note: if you are using trn1n instances, please adjust the --instance_type parameter above to be trn1n.32xlarge.

In the above command you will note various options that are passed to the torchx run command:

  • -s kubernetes: Selects the Kubernetes scheduler
  • --workspace "file:///$PWD/docker": Specifies a local workspace directory that TorchX will overlay on top of the base training container image as part of the training job. This can be used to add a new training script and dependencies to the training container, modify an existing training script, add a small dataset to the container, etc. In this tutorial, we use the local docker build directory as the workspace, however, the workspace directory can be any local directory. Note: the local workspace directory is overlayed at the root of the training container, so you need to be careful not to accidentally overwrite required system directories such as /bin, /lib, or /etc, otherwise your TorchX container might not be able to run.
  • -cfg queue=test,image_repo=ECR_REPO: Configures the job queue and ECR repo used for TorchX images
  • lib/ Path to the Python function used to programmatically create the TorchX AppDef for this job. See lib/ for additional details.
  • --name bertcompile: Name of this TorchX job
  • --script_args "...": Command-line arguments that will be passed to the training script. When performing precompilation, it is advised to limit the number of training steps to ~10 as we do here using --steps_this_run 10
  • --nnodes 2: Number of trn1 nodes required for this job
  • --nproc_per_node 32: Number of training processes to run per node
  • --image $ECR_REPO:bert_pretrain: The container image to use for the training job
  • --script Name of the training script to run inside the training container
  • --bf16 True: Enable BF16 data type for training
  • --cacheset bert-large: A user-specified string used to prefix the Neuron and Transformers caches on shared storage. The cacheset can be shared across TorchX jobs but should not be used by jobs that will run concurrently. If multiple concurrent jobs share a cacheset, cache corruption could occur.
  • --precompile True: Launch the training script using Neuron's neuron_parallel_compile tool in order to precompile the graphs
  • --instance_type trn1.32xlarge: Specify which type of trn1 instance to use for this job (trn1.32xlarge or trn1n.32xlarge)

Run kubectl get pods and check to ensure that you see 2 bertcompile- pods that are "Running". If the status shows as "ContainerCreating", please wait a few seconds and re-run the command until status changes to "Running".

bash> kubectl get pods
NAME                                   READY   STATUS    RESTARTS   AGE
bertcompile-hpzwjhg4zlq25c-role1-0-0   1/1     Running   0          5m52s
bertcompile-hpzwjhg4zlq25c-role1-1-0   1/1     Running   0          5m52s

Next, choose one of your bertcompile- pods and run the following command to monitor the output of the precompilation job:


The precompilation job will run for ~15 minutes. Once complete, you will see the following in the output:

2022-12-21 00:07:10.000925: INFO ||PARALLEL_COMPILE||: Total graphs: 6
2022-12-21 00:07:10.000925: INFO ||PARALLEL_COMPILE||: Total successful compilations: 6
2022-12-21 00:07:10.000925: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

3.4 Launch BERT pretraining job using 64 workers across 2 trn1.32xlarge (or trn1n.32xlarge) instances

Run the following commands to launch the 64-worker BERT pretraining job on the EKS cluster:

ECR_REPO=$(aws ecr describe-repositories --repository-name eks_torchx_tutorial \
    --query repositories[0].repositoryUri --output text)

torchx run \
    -s kubernetes --workspace="file:///$PWD/docker" \
    -cfg queue=test,image_repo=$ECR_REPO \
    lib/ \
    --name berttrain \
    --script_args "--batch_size 16 --grad_accum_usteps 32 \
        --data_dir /data/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128 \
        --output_dir /data/output" \
    --nnodes 2 \
    --nproc_per_node 32 \
    --image $ECR_REPO:bert_pretrain \
    --script \
    --bf16 True \
    --cacheset bert-large \
    --instance_type trn1.32xlarge

Note: if you are using trn1n instances, please adjust the --instance_type parameter above to be trn1n.32xlarge.

4. Monitor Training

Run the following command to check the status of the recently submitted training job:

kubectl get vcjob
bash> kubectl get vcjob
NAME                         STATUS      MINAVAILABLE   RUNNINGS   AGE
bertcompile-hpzwjhg4zlq25c   Completed   2                         19m
berttrain-shpw6kn367csdc     Running     2              2          19s

When the status of your "berttrain-" job shows as Running, use the following command to identify the pods associated with the job:

kubectl get pods
bash> kubectl get pods
NAME                                   READY   STATUS      RESTARTS   AGE
bertcompile-hpzwjhg4zlq25c-role1-0-0   0/1     Completed   0          20m
bertcompile-hpzwjhg4zlq25c-role1-1-0   0/1     Completed   0          20m
berttrain-shpw6kn367csdc-role1-0-0     1/1     Running     0          86s
berttrain-shpw6kn367csdc-role1-1-0     1/1     Running     0          86s

To view the training script output, you first need to know which of the running pods represents the rank0 worker in the distributed training job. For the BERT pretraining script, only the rank0 worker outputs training metrics. During training job initialization, rank is randomly assigned among the participants, and is not directly related to the pod name. You can determine the rank0 worker by running the following script:

bash> ./scripts/
YOUR_POD_NAME is your rank0 worker pod.

Once you have determined the name of your rank0 worker pod, you can substitute the pod name into the following command to view the training script output. If you do not see training metrics in the logs (as shown below), please wait 1-2 minutes and re-run the command.

kubectl logs YOUR_POD_NAME|tail -3
bash> kubectl logs YOUR_POD_NAME|tail -3
[0]:LOG Wed Dec 21 00:51:01 2022 - (0, 14) step_loss : 10.7011 learning_rate : 2.80e-06 throughput : 6449.36
[0]:LOG Wed Dec 21 00:51:06 2022 - (0, 15) step_loss : 10.6613 learning_rate : 3.00e-06 throughput : 6453.70
[0]:LOG Wed Dec 21 00:51:11 2022 - (0, 16) step_loss : 10.5861 learning_rate : 3.20e-06 throughput : 6458.87

To continously view the training script output (similar to the tail -f command in Linux), you can use the following command. The command can be terminated using CTRL-C.

kubectl logs -f YOUR_POD_NAME

View training progress in Tensorboard

The BERT training job also stores training metrics on the FSx for Lustre shared storage volume. To view these metrics in Tensorboard, you can launch a Tensorboard deployment within the EKS environment using the following script:


The script will first build a Tensorboard container and push it to your ECR repository. Next, the Tensorboard deployment will be launched within your EKS cluster. When the script completes, it will output a password-protected URL that you can use to access Tensorboard. Please note that it may take 1-2 minutes for the URL to become accessible.

Open the provided URL to access the Tensorboard interface depicted below:


Monitor Neuron device utilization using neuron-top

The Neuron SDK provides Neuron tools for monitoring Neuron devices on Inf1 and Trn1 instances. During a training job it is often useful to monitor Neuron device utilization using neuron-top, which provides a text-based view of device and memory utilization.

To view neuron-top statistics for one of your nodes, begin by choosing one of your running BERT training pods:

kubectl get pods|grep Running|grep bert

Substitute the name of one of your running pods into the following command to launch a bash prompt within the running pod:

kubectl exec -it YOUR_POD_NAME -- /bin/bash

At the bash prompt, run neuron-top:


It should look something like the below:


When you are finished exploring neuron-top, press q to quit. At the pod's bash prompt, press CTRL-D to return to your jump host.

5. Clean-up

When you are finished with the tutorial, run the following commands on the jump host to remove the EKS cluster and associated resources:

# Delete Tensorboard deployment
kubectl delete -f tensorboard_manifest.yaml

# Delete any remaining jobs and pods
kubectl delete vcjob --all
kubectl delete pods --all

# Delete FSX resources
kubectl delete -f storageclass.yaml
kubectl delete -f claim.yaml

# Delete nodegroup - run the applicable command depending on which instance type you are using
eksctl delete nodegroup trn1-32xl-ng1 --cluster my-trn1-cluster --wait --approve    # for trn1
eksctl delete nodegroup trn1n-32xl-ng1 --cluster my-trn1-cluster --wait --approve   # for trn1n

# Delete Cluster resources
aws cloudformation delete-stack --stack-name eks-trn1-ng-stack
aws cloudformation wait stack-delete-complete --stack-name eks-trn1-ng-stack
eksctl delete cluster my-trn1-cluster

# Delete Container repository
aws ecr delete-repository --force --repository-name eks_torchx_tutorial

Lastly, terminate your jump host instance and delete the eks_tutorial IAM user via the AWS Console.