Skip to content

Repository showcasing ML Ops practices with kubeflow and mlflow

License

Notifications You must be signed in to change notification settings

MGTheTrain/ml-ops-poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ml-ops-poc

Table of Contents

Summary

Repository showcasing ML Ops practices with kubeflow and mlflow

References

Features

  • Deployment of Azure Kubernetes Service (AKS) clusters
  • kubeflow operator or mlflow helm chart installations in deployed AKS clusters
  • CD workflow for on-demand AKS deployments and kubeflow operator or mlflow helm chart installations
  • CD wofklow for on demand deployments of an Azure Storage Account Container (For storing terraform state files)
  • CD workflow for on-demand Azure Container Registry deployments in order to store internal Docker images.
  • Added devcontainer.json with necessary tooling for local development
  • Python (PyTorch or TensorFlow) application for ML training and inference purposes and Jupyter notebooks
    • Simple feedforward neural network with MNIST dataset to map input images to their corresponding digit classes
    • CNN architecture training and inference considering COCO dataset for image classification AI applications (NOTE: Compute and storage intensive. Read Download the COCO dataset images comments on preferred hardware specs)
    • Transformer architecture training considering pre-trained models for chatbot AI applications
  • Dockerizing Python (PyTorch or TensorFlow) applications for ML training and inference
  • CI pipeline deploying an ACR
  • CI pipeline containerizing and pushing Python TensorFlow or PyTorch applications for training to a deployed ACR
  • Helm charts with K8s manifests for containerized Python TensorFlow/PyTorch ML jobs using the Training Operator for CRDs and GitOps trough ArgoCD
  • Installation of the Training Operator for CRDs and applying sample TFJob and PyTorchJob k8s manifest
  • Internal inference service and client along with Dockerization and Helm chart integration of the service application
  • Enable GPU accelerated ML trainning and inference k8s pods. Add corresponding helm charts. Checkout Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS). "For AKS node pools, we recommend a minimum size of Standard_NC6s_v3"

Getting started

Github workflows will be utilized in this Github repository. Once the workflows described in the Preconditions and Deploy an AKS cluster and install the kubeflow or mlflow components sections have been successfully executed, all resource groups listed should be visible in the Azure Portal UI:

Deployed resource groups Deployed cloud-infra resource group

Preconditions

  1. Deploy an Azure Storage Account Service including container for terraform backends trough the deploy-tf-backend workflow

Deploy an AKS cluster, install the kubeflow or mlflow components or setup kubernetes resources for applications

  1. Deploy an AKS trough the deploy-k8s-cluster workflow
  2. Optional: Install external helm charts (e.g. ml-ops tools) into the deployed kubernetes cluster trough install-helm-charts workflow
  3. Optional: Deploy kubernetes resources for application (secrets or reverse-proxy ingress) trough create-internal-k8s-resources workflow

NOTE:

  • Set all the required Github secrets for aboves workflows
  • In order to locally access the deployed AKS cluster launch the devcontainer and retrieve the necessary kube config as displayed in the GitHub workflow step labeled with title Download the ~/.kube/config

kubeflow

To access the kubeflow dashboard following the installation of kustomize and kubeflow components, execute the following command:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

and visit in a browser of choice localhost:8080.

Finally, open http://localhost:8080 and login with the default user’s credentials. The default email address is [email protected] and the default password is 12341234.

kubeflow-dashboard

Jupyter notebooks

When creating the Jupyter notebook instance consider the following data volume:

Jupyter instance data volume

The volumes that were created appear as follows:

Jupyter instance created volumes

The Jypter instace that was created appear as follows:

Created Jupyter instance

NOTE: You can check the status of the Jupyter instance pods:

Check jupyter instance pods

Once CONNECTED to a Jupyter instance ensure to clone this Git repository (HTTPS URL: https://github.com/MGTheTrain/ml-ops-poc.git):

Clone git repository

You then should have the repository cloned in your workspace:

Cloned git repository in jupyter instance

Execute a Jupyter notebook to either train the model or perform inference (P.S. It's preferable to begin with the mnist-trainnig.ipynb. Others are either resource intensive or not yet implemented):

Run jupyter notebook example

Applying TFJob or PyTorchJob k8s manifests

After successful installation of the Kubeflow Training Operator, apply some sample k8s ML training jobs, e.g. for PyTorch and for Tensorflow.

# Pytorch (https://github.com/kubeflow/training-operator/blob/release-1.9/examples/pytorch/simple.yaml)
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/release-1.9/examples/pytorch/simple.yaml

training operator simple pytorch job training operator simple pytorch job pt 2

# Tensorflow (https://github.com/kubeflow/training-operator/blob/release-1.9/examples/tensorflow/simple.yaml)
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/release-1.9/examples/tensorflow/simple.yaml

training operator simple tf job

You can also register and sync ArgoCD applications referencing Helm charts to enable GitOps. For more details check out the gitops-poc repository. Essential commands for the Keras MNIST training example are:

# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443

# The default username is admin. The default password can be obtained trough: kubectl -n argocd get secret argocd-initial-admin-secret -n external-services -o jsonpath="{.data.password}" | base64 -d

# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password

# e.g. for keras-mnist chart
argocd app create keras-mnist \
  --repo https://github.com/MGTheTrain/ml-ops-poc.git \
  --path gitops/argocd/keras-mnist-training \ 
  --dest-server https://kubernetes.default.svc \
  --dest-namespace internal-apps \
  --revision main \
  --server localhost:8080

# In terminal process B - Sync Application
argocd app sync keras-mnist
# In terminal process B - Monitor Application Status
argocd app get keras-mnist

The ArgoCD applications that have been registered and synchronized should resemble the following:

ArgoCD applications

MNIST keras training argocd app

Training job logs resemble:

Training Operator Keras MNIST Training tf training job logs Training Operator Keras MNIST Training tf training job logs pt 2

The training job considers the upload of the trained model to an Azure Storage Account Container as the final step:

Training Operator Keras MNIST Training tf training job uploaded model in Azure Storage Account

Training job status resemble:

Training Operator Keras MNIST Training tf training job status

KServe InferenceService

Refer to the following link for guidance.

Set up an authorized Azure Service Principal:

az ad sp create-for-rbac --name model-store-sp --role "Storage Blob Data Owner" --scopes /subscriptions/<your subscription id>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>

Edit the secrets stringData values file and run:

kubectl apply -n internal-apps -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: azcreds
type: Opaque
stringData:
  AZ_CLIENT_ID: <your AZ_CLIENT_ID>
  AZ_CLIENT_SECRET: <your AZ_CLIENT_SECRET>
  AZ_SUBSCRIPTION_ID: <your AZ_SUBSCRIPTION_ID>
  AZ_TENANT_ID: <your AZ_TENANT_ID>
EOF

Register and synchronize the ArgoCD application:

# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443

# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password

# e.g. for keras-mnist-inference chart
argocd app create keras-mnist-inference \
  --repo https://github.com/MGTheTrain/ml-ops-poc.git \
  --path gitops/argocd/keras-mnist-inference \ 
  --dest-server https://kubernetes.default.svc \
  --dest-namespace internal-apps \
  --revision main \
  --server localhost:8080

# In terminal process B - Sync Application
argocd app sync keras-mnist-inference

Due to AKS node resource constraints experiments related to InferenceServices trough KServe have been aborted:

aborted due to vm scaling constraints

aborted due to vm scaling constraints part 2

The Inference Service pulls the tensorflow/serving docker image, which could lead to allocation issues due to its size of 1 to 1.5 GB.

aborted due to vm scaling constraints part 3

Internal inference service

Create Blob secret:

kubectl create secret generic blob-secret --from-literal=blob_name=<mnist_model-20250206190322.h5> -n internal-apps

Register and synchronize the ArgoCD application:

# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443

# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password

# e.g. for keras-mnist-internal-inference chart
argocd app create keras-mnist-internal-inference \
  --repo https://github.com/MGTheTrain/ml-ops-poc.git \
  --path gitops/argocd/keras-mnist-internal-inference \ 
  --dest-server https://kubernetes.default.svc \
  --dest-namespace internal-apps \
  --revision main \
  --server localhost:8080

# In terminal process B - Sync Application
argocd app sync keras-mnist-inference

Resulting inference service logs should resemble:

internal inference service logs

once the client has submitted a /predict request to the inference service:

internal inference client logs

mlflow

To access the MLflow dashboard following the installation of the MLflow Helm chart, execute the following command:

kubectl port-forward -n ml-ops-poc <mlflow pod name> 5000:5000

and visit in a browser of choice localhost:5000.

mlflow-dashboard

Destroy the AKS cluster, uninstall helm charts or remove kubernetes resources for applications

  1. Optional: Uninstall only ml tools of an existing kubernetes cluster trough uninstall-helm-charts workflow
  2. Optional: Destroy kubernetes resources for application (secrets or reverse-proxy ingress) trough delete-internal-k8s-resources workflow
  3. Destroy an AKS trough the destroy-k8s-cluster workflow