aws-neuron-eks-samples/sd_hf_serve at master · aws-neuron/aws-neuron-eks-samples

Name	Name	Last commit message	Last commit date
parent directory ..
app	app
oci-image-build	oci-image-build
.gitignore	.gitignore
README.md	README.md
allow-access-to-model-assets.json	allow-access-to-model-assets.json
inf2-nodepool.yaml	inf2-nodepool.yaml
k8s-neuron-device-plugin-rbac.yml	k8s-neuron-device-plugin-rbac.yml
k8s-neuron-device-plugin.yml	k8s-neuron-device-plugin.yml
k8s-neuron-scheduler-eks.yml	k8s-neuron-scheduler-eks.yml
my-scheduler.yml	my-scheduler.yml
sd21-512-compile-job.yaml	sd21-512-compile-job.yaml
sd21-512-serve-deploy.yaml	sd21-512-serve-deploy.yaml
sd21-512-serve-ingress.yaml	sd21-512-serve-ingress.yaml
sd21-512-serve-svc.yaml	sd21-512-serve-svc.yaml
sd21-sa.yaml	sd21-sa.yaml
sdhfserve.png	sdhfserve.png
sdhfserve1.png	sdhfserve1.png
trust-relationship.json	trust-relationship.json

Deploy sd2_512 inference endpoint

Inference services need to be (1) reliable, (2) performant, and (3) cost-effective. Reliability involves quick and automatic recovering from hardware and software failures. Performant means low latency response to user requests that includes scaling compute accelerators seamlessly, quickly, and keeping resource utilization optimal and finally, releasing resources not needed.

This solution describes a stabilityai/stable-diffusion-2-1-base inference pipeline that implements reliable, performant and cost-effective AWS services using EC2 Inferentia instances. CodePipeline orchestrates CodeBuild DLC-based container images that are sourced in github repository and pushed to ECR. We compile the model with a K8s batch/v1 Job that stores it in S3. Next, we deploy K8s apps/v1 Deployment that pulls the model from S3 and instanciate it from images stored in ECR. Finally, we deploy networking.k8s.io/v1 Ingress that builds AWS LoadBalancer and TargetGroups that auto discover the pods that power the Inference app (Gradio).

For reliability, we use Karpenter. Karpenter node-pool manages Inferentia node lifecycle. Karpenter adds Inferentia nodes to handle unschedulable pods, schedules pods on those nodes, and removes them when not needed.

Our NodePool uses the instance-familiy key to specify the instances needed i.e., inf2.xlarge, inf2.8xlarge, inf2.24xlarge, and inf2.48xlarge.

       - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["inf2"]

The Pod resource limits will determine which Inf2 instance will be requested by the total number of required neuron devices. In the example below the scheduler will launch inf2.xlarge when the replica-set size is 1 and increase linearly by the total number of needed neuron devices.

        resources:
          limits:
            aws.amazon.com/neuron: 1

The neuron-device-plugin returns the list of Neuron cores/devices to kubelet. Kubelet advertises the Core/Device list to K8s API server (in turn to Kube-Scheduler). Kube-Scheduler picks up the pod creation request. Kube-scheduler calls the neuron-scheduler-extn filter function with a list of nodes and pod specifications. neuron-scheduler scans the nodes and filters out nodes with non contiguous cores/devices and returns the nodes that are capable of supporting the pod specification. Kube-scheduler invokes the neuron-scheduler bind function with pod and node. It updates the pod annotation with allocated neuron contiguous core/device Ids. Then neuron-scheduler sends the bind request to kubelet of the selected node. Kubelet calls the Alloc function of the neuron-device-plugin. neuron-device-plugin queries the pod annotation for allocated core/device Ids and exports the device cores to container runtime.To learn more about how the Neuron scheduler filters nodes see the documentation

The solution performance comprises of the time it takes to bring neuron devices online and prepare them for process inference requests. Neuron runtime requires the model to be compiled to Neuron Executable File Format (NEFF) so that it can be executed on the Inferentia chips. Compiling a model can weigh several gigabytes and delay container startup. Therefore, we load and extract the model files from external storage to /model with InitContainer and mount it to the serving pod under /app that uses the same workday volume.

     initContainers:
      - name: pull-model-tar
        image: ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/stablediffusion:amd64-neuron-assets
        imagePullPolicy: Always
        volumeMounts:
        - name: workdir
          mountPath: /model
        command:
        - /bin/bash
        - "-exc"
        - |
          set -x
          aws s3 cp s3://sdinfer/stable-diffusion-2-1-base.tar.gz /model/model.tar.gz
          tar -xzf /model/model.tar.gz
     containers:
      - name: app
        image: ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/stablediffusion:amd64-neuron
        volumeMounts:
          - mountPath: /app
            name: workdir
...
    volumes:
      - name: workdir
        emptyDir: {}

Gradio app networking rules are defined by the Ingress deployment, which includes application readiness, health, and traffic routing rules like session stickiness. Readiness and health rules need to align with the K8s deployment specification to minimize end-user impact during unexpected failures or scale-down events. e.g.,

        readinessProbe:
          httpGet:
            path: /readiness
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10

Tell Kubelet to consider the pod ready when HTTP Get every 10 seconds to /readiness URI on the pod is successful only after 60 seconds. This will allow the pod to load the model from external stores such as Hugging Face and S3. The Ingress annotations will instruct ALB to probe the pod /health URI every 10 seconds to consider the pod healthy.

  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/healthcheck-path: /health
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: '10'
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '9'
    alb.ingress.kubernetes.io/healthy-threshold-count: '2'
    alb.ingress.kubernetes.io/unhealthy-threshold-count: '3'
    alb.ingress.kubernetes.io/success-codes: '200-301'

This is a StableDiffusionPipeline based on stabilityai/stable-diffusion-2-1-base. Updated compile and benchmark code is in sd2_512_benchmark and sd2_512_compile

Walkthrough

aws eks create-addon --cluster-name ${CLUSTER_NAME} --addon-name vpc-cni --addon-version v1.16.2-eksbuild.1 \
    --service-account-role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/AmazonEKSVPCCNIRole
aws eks describe-addon --cluster-name ${CLUSTER_NAME} --addon-name vpc-cni --query addon.addonVersion --output text

Deploy karpenter nodepool for inferentia

  cat inf2-nodepool.yaml | envsubst | kubectl apply -f -

Deploy the Neuron plugin

  kubectl apply -f k8s-neuron-device-plugin-rbac.yml
  kubectl apply -f k8s-neuron-device-plugin.yml
  kubectl apply -f k8s-neuron-scheduler-eks.yml
  kubectl apply -f my-scheduler.yml

Deploy the OCI image pipeline
Allow access to model assets S3 bucket using pod identity

kubectl apply -f sd21-sa.yaml
aws iam create-policy --policy-name allow-access-to-model-assets --policy-document file://allow-access-to-model-assets.json
aws iam create-role --role-name allow-access-to-model-assets --assume-role-policy-document file://trust-relationship.json --description "allow-access-to-model-assets"
aws iam attach-role-policy --role-name allow-access-to-model-assets --policy-arn=arn:aws:iam::${AWS_ACCOUNT_ID}:policy/allow-access-to-model-assets
aws eks create-pod-identity-association --cluster-name yahavb-neuron-demo --role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/allow-access-to-model-assets --namespace default --service-account sd21-sa

Deploy a job that compiles the model with Neuron SDK and stage it in S3 bucket ${BUCKET}

kubectl apply -f sd21-512-compile-job.yaml

The model file is in S3 ${BUCKET}/${MODEL_FILE}.tar.gz; deploy the inference replicaset

kubectl apply -f sd21-512-server-deploy.yaml

Identify the security group that attached to the node that runs the pod launched. The security group description "Communication between all nodes in the cluster" Add tag kubernetes.io/cluster/${CLUSTER_NAME}=shared

Deploy AWS Load Balancer controller to enable public ingress access to the inference pods

kubectl apply -f sd21-512-serve-svc.yaml
kubectl apply -f sd21-512-serve-ingress.yaml

The final step is to discover the Gradio app endpoint and the pods that will process the user's real-time inference requests.

kubectl get ingress
NAME               CLASS   HOSTS   ADDRESS                                                PORTS   AGE
sd21-512-ingress   alb     *       sd21inf2serve-1693537287.us-west-2.elb.amazonaws.com   80      1h
[sd_hf_serve]$kubectl get po
NAME                               READY   STATUS    RESTARTS   AGE
sd21-inf2-serve-589566d645-2ktjf   1/1     Running   0          1h

Note the pod STATUS which indicates that /readiness requests return 200 responses.

The pod logs confirm that:

kubectl logs sd21-inf2-serve-589566d645-2ktjf
INFO:     192.168.183.154:41638 - "GET /readiness HTTP/1.1" 200 OK
INFO:     192.168.92.117:38404 - "GET /health HTTP/1.1" 200 OK
INFO:     192.168.44.216:7670 - "GET /health HTTP/1.1" 200 OK
INFO:     192.168.10.49:57274 - "GET /health HTTP/1.1" 200 OK
INFO:     192.168.183.154:46682 - "GET /readiness HTTP/1.1" 200 OK
INFO:     192.168.92.117:56820 - "GET /health HTTP/1.1" 200 OK
INFO:     192.168.44.216:32514 - "GET /health HTTP/1.1" 200 OK
INFO:     192.168.10.49:26812 - "GET /health HTTP/1.1" 200 OK

Note that the pod is powered by an inf2.xlarge instance that satisfies aws.amazon.com/neuron: 1 resource request.

kubectl get no -L node.kubernetes.io/instance-type
NAME                                            STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE
ip-192-168-183-154.us-west-2.compute.internal   Ready    <none>   1h   v1.28.5-eks-5e0fdde   inf2.xlarge
ip-192-168-53-23.us-west-2.compute.internal     Ready    <none>   2d     v1.28.5-eks-5e0fdde   m5.large
ip-192-168-7-215.us-west-2.compute.internal     Ready    <none>   2d     v1.28.5-eks-5e0fdde   m5.large

Invoking the Gradio app endpoint:

curl sd21inf2serve-1693537287.us-west-2.elb.amazonaws.com
{"message":"This is stabilityai/stable-diffusion-2-1-base on AWS EC2 xlainstance; try /load/{n_runs}, /serve, /health, or /ready"}

Indicates the app's supported API calls. We use /serve to invoke the Gradio app for images.

Scale the Gardio app from 1 pod to 2 and notice Karpenter brings another inf2.xlarge online to power the addtional pod.

kubectl scale deploy sd21-inf2-serve --replicas=2
....
kubectl get no -L node.kubernetes.io/instance-type
NAME                                            STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE
ip-192-168-183-154.us-west-2.compute.internal   Ready    <none>   1h   v1.28.5-eks-5e0fdde   inf2.xlarge
ip-192-168-53-23.us-west-2.compute.internal     Ready    <none>   2d   v1.28.5-eks-5e0fdde   m5.large
ip-192-168-7-215.us-west-2.compute.internal     Ready    <none>   2d   v1.28.5-eks-5e0fdde   m5.large
ip-192-168-74-41.us-west-2.compute.internal     Ready    <none>   10m  v1.28.5-eks-5e0fdde  inf2.xlarge
...
kubectl get po
NAME                               READY   STATUS    RESTARTS   AGE
sd21-inf2-serve-589566d645-2ktjf   1/1     Running   0          1h
sd21-inf2-serve-589566d645-g6jvw   1/1     Running   0          10m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sd_hf_serve

sd_hf_serve

README.md

Deploy sd2_512 inference endpoint

Walkthrough

Files

sd_hf_serve

Directory actions

More options

Directory actions

More options

Latest commit

History

sd_hf_serve

Folders and files

parent directory

README.md

Deploy sd2_512 inference endpoint

Walkthrough