Repository showcasing ML Ops practices with kubeflow and mlflow
- Deploy kubeflow into an AKS cluster using default settings
- kubeflow - Minimum system requirements
- Deploy InferenceService with saved model on Azure
- Kubeflow components and external add-ons. Note that KServe is an external add-on and needs to be installed
- Install KServe. NOTE: KServe v0.7.0 leads to errors. v0.9.0 works
- Distributed Machine Learning Patterns Github repository
- Deployment of Azure Kubernetes Service (AKS) clusters
- kubeflow operator or mlflow helm chart installations in deployed AKS clusters
- CD workflow for on-demand AKS deployments and kubeflow operator or mlflow helm chart installations
- CD wofklow for on demand deployments of an Azure Storage Account Container (For storing terraform state files)
- CD workflow for on-demand Azure Container Registry deployments in order to store internal Docker images.
- Added
devcontainer.json
with necessary tooling for local development - Python (PyTorch or TensorFlow) application for ML training and inference purposes and Jupyter notebooks
- Simple feedforward neural network with MNIST dataset to map input images to their corresponding digit classes
- CNN architecture training and inference considering COCO dataset for image classification AI applications (NOTE: Compute and storage intensive. Read
Download the COCO dataset images
comments on preferred hardware specs) - Transformer architecture training considering pre-trained models for chatbot AI applications
- Dockerizing Python (PyTorch or TensorFlow) applications for ML training and inference
- CI pipeline deploying an ACR
- CI pipeline containerizing and pushing Python TensorFlow or PyTorch applications for training to a deployed ACR
- Helm charts with K8s manifests for containerized Python TensorFlow/PyTorch ML jobs using the Training Operator for CRDs and GitOps trough ArgoCD
- Installation of the Training Operator for CRDs and applying sample TFJob and PyTorchJob k8s manifest
- Internal inference service and client along with Dockerization and Helm chart integration of the service application
- Enable GPU accelerated ML trainning and inference k8s pods. Add corresponding helm charts. Checkout Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS). "For AKS node pools, we recommend a minimum size of
Standard_NC6s_v3
"
Github workflows will be utilized in this Github repository. Once the workflows described in the Preconditions and Deploy an AKS cluster and install the kubeflow or mlflow components sections have been successfully executed, all resource groups listed should be visible in the Azure Portal UI:
- Deploy an Azure Storage Account Service including container for terraform backends trough the deploy-tf-backend workflow
Deploy an AKS cluster, install the kubeflow or mlflow components or setup kubernetes resources for applications
- Deploy an AKS trough the deploy-k8s-cluster workflow
- Optional: Install external helm charts (e.g. ml-ops tools) into the deployed kubernetes cluster trough install-helm-charts workflow
- Optional: Deploy kubernetes resources for application (secrets or reverse-proxy ingress) trough create-internal-k8s-resources workflow
NOTE:
- Set all the required Github secrets for aboves workflows
- In order to locally access the deployed AKS cluster launch the devcontainer and retrieve the necessary kube config as displayed in the GitHub workflow step labeled with title Download the ~/.kube/config
To access the kubeflow dashboard following the installation of kustomize and kubeflow components, execute the following command:
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
and visit in a browser of choice localhost:8080
.
When creating the Jupyter notebook instance consider the following data volume:
The volumes that were created appear as follows:
The Jypter instace that was created appear as follows:
NOTE: You can check the status of the Jupyter instance pods:
Once CONNECTED
to a Jupyter instance ensure to clone this Git repository (HTTPS URL: https://github.com/MGTheTrain/ml-ops-poc.git
):
You then should have the repository cloned in your workspace:
Execute a Jupyter notebook to either train the model or perform inference (P.S. It's preferable to begin with the mnist-trainnig.ipynb. Others are either resource intensive or not yet implemented):
After successful installation of the Kubeflow Training Operator, apply some sample k8s ML training jobs, e.g. for PyTorch and for Tensorflow.
# Pytorch (https://github.com/kubeflow/training-operator/blob/release-1.9/examples/pytorch/simple.yaml)
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/release-1.9/examples/pytorch/simple.yaml
# Tensorflow (https://github.com/kubeflow/training-operator/blob/release-1.9/examples/tensorflow/simple.yaml)
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/release-1.9/examples/tensorflow/simple.yaml
You can also register and sync ArgoCD applications referencing Helm charts to enable GitOps. For more details check out the gitops-poc repository. Essential commands for the Keras MNIST training example are:
# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443
# The default username is admin. The default password can be obtained trough: kubectl -n argocd get secret argocd-initial-admin-secret -n external-services -o jsonpath="{.data.password}" | base64 -d
# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password
# e.g. for keras-mnist chart
argocd app create keras-mnist \
--repo https://github.com/MGTheTrain/ml-ops-poc.git \
--path gitops/argocd/keras-mnist-training \
--dest-server https://kubernetes.default.svc \
--dest-namespace internal-apps \
--revision main \
--server localhost:8080
# In terminal process B - Sync Application
argocd app sync keras-mnist
# In terminal process B - Monitor Application Status
argocd app get keras-mnist
The ArgoCD applications that have been registered and synchronized should resemble the following:
Training job logs resemble:
The training job considers the upload of the trained model to an Azure Storage Account Container as the final step:
Training job status resemble:
Refer to the following link for guidance.
Set up an authorized Azure Service Principal:
az ad sp create-for-rbac --name model-store-sp --role "Storage Blob Data Owner" --scopes /subscriptions/<your subscription id>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>
Edit the secrets stringData
values file and run:
kubectl apply -n internal-apps -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: azcreds
type: Opaque
stringData:
AZ_CLIENT_ID: <your AZ_CLIENT_ID>
AZ_CLIENT_SECRET: <your AZ_CLIENT_SECRET>
AZ_SUBSCRIPTION_ID: <your AZ_SUBSCRIPTION_ID>
AZ_TENANT_ID: <your AZ_TENANT_ID>
EOF
Register and synchronize the ArgoCD application:
# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443
# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password
# e.g. for keras-mnist-inference chart
argocd app create keras-mnist-inference \
--repo https://github.com/MGTheTrain/ml-ops-poc.git \
--path gitops/argocd/keras-mnist-inference \
--dest-server https://kubernetes.default.svc \
--dest-namespace internal-apps \
--revision main \
--server localhost:8080
# In terminal process B - Sync Application
argocd app sync keras-mnist-inference
Due to AKS node resource constraints experiments related to InferenceServices trough KServe have been aborted:
The Inference Service pulls the tensorflow/serving
docker image, which could lead to allocation issues due to its size of 1 to 1.5 GB.
Create Blob secret:
kubectl create secret generic blob-secret --from-literal=blob_name=<mnist_model-20250206190322.h5> -n internal-apps
Register and synchronize the ArgoCD application:
# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443
# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password
# e.g. for keras-mnist-internal-inference chart
argocd app create keras-mnist-internal-inference \
--repo https://github.com/MGTheTrain/ml-ops-poc.git \
--path gitops/argocd/keras-mnist-internal-inference \
--dest-server https://kubernetes.default.svc \
--dest-namespace internal-apps \
--revision main \
--server localhost:8080
# In terminal process B - Sync Application
argocd app sync keras-mnist-inference
Resulting inference service logs should resemble:
once the client has submitted a /predict
request to the inference service:
To access the MLflow dashboard following the installation of the MLflow Helm chart, execute the following command:
kubectl port-forward -n ml-ops-poc <mlflow pod name> 5000:5000
and visit in a browser of choice localhost:5000.
- Optional: Uninstall only ml tools of an existing kubernetes cluster trough uninstall-helm-charts workflow
- Optional: Destroy kubernetes resources for application (secrets or reverse-proxy ingress) trough delete-internal-k8s-resources workflow
- Destroy an AKS trough the destroy-k8s-cluster workflow