ml-ops-poc

Summary

Repository showcasing ML Ops practices with kubeflow and mlflow

References

Features

Getting started

Github workflows will be utilized in this Github repository. Once the workflows described in the Preconditions and Deploy an AKS cluster and install the kubeflow or mlflow components sections have been successfully executed, all resource groups listed should be visible in the Azure Portal UI:

Preconditions

Deploy an Azure Storage Account Service including container for terraform backends trough the deploy-tf-backend workflow

Deploy an AKS cluster, install the kubeflow or mlflow components or setup kubernetes resources for applications

Deploy an AKS trough the deploy-k8s-cluster workflow
Optional: Install external helm charts (e.g. ml-ops tools) into the deployed kubernetes cluster trough install-helm-charts workflow
Optional: Deploy kubernetes resources for application (secrets or reverse-proxy ingress) trough create-internal-k8s-resources workflow

NOTE:

Set all the required Github secrets for aboves workflows
In order to locally access the deployed AKS cluster launch the devcontainer and retrieve the necessary kube config as displayed in the GitHub workflow step labeled with title Download the ~/.kube/config

kubeflow

To access the kubeflow dashboard following the installation of kustomize and kubeflow components, execute the following command:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

and visit in a browser of choice localhost:8080.

Finally, open http://localhost:8080 and login with the default user’s credentials. The default email address is [email protected] and the default password is 12341234.

Jupyter notebooks

When creating the Jupyter notebook instance consider the following data volume:

The volumes that were created appear as follows:

The Jypter instace that was created appear as follows:

NOTE: You can check the status of the Jupyter instance pods:

Once CONNECTED to a Jupyter instance ensure to clone this Git repository (HTTPS URL: https://github.com/MGTheTrain/ml-ops-poc.git):

You then should have the repository cloned in your workspace:

Execute a Jupyter notebook to either train the model or perform inference (P.S. It's preferable to begin with the mnist-trainnig.ipynb. Others are either resource intensive or not yet implemented):

Applying TFJob or PyTorchJob k8s manifests

After successful installation of the Kubeflow Training Operator, apply some sample k8s ML training jobs, e.g. for PyTorch and for Tensorflow.

# Pytorch (https://github.com/kubeflow/training-operator/blob/release-1.9/examples/pytorch/simple.yaml)
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/release-1.9/examples/pytorch/simple.yaml

# Tensorflow (https://github.com/kubeflow/training-operator/blob/release-1.9/examples/tensorflow/simple.yaml)
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/release-1.9/examples/tensorflow/simple.yaml

You can also register and sync ArgoCD applications referencing Helm charts to enable GitOps. For more details check out the gitops-poc repository. Essential commands for the Keras MNIST training example are:

# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443

# The default username is admin. The default password can be obtained trough: kubectl -n argocd get secret argocd-initial-admin-secret -n external-services -o jsonpath="{.data.password}" | base64 -d

# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password

# e.g. for keras-mnist chart
argocd app create keras-mnist \
  --repo https://github.com/MGTheTrain/ml-ops-poc.git \
  --path gitops/argocd/keras-mnist-training \ 
  --dest-server https://kubernetes.default.svc \
  --dest-namespace internal-apps \
  --revision main \
  --server localhost:8080

# In terminal process B - Sync Application
argocd app sync keras-mnist
# In terminal process B - Monitor Application Status
argocd app get keras-mnist

The ArgoCD applications that have been registered and synchronized should resemble the following:

Training job logs resemble:

The training job considers the upload of the trained model to an Azure Storage Account Container as the final step:

Training job status resemble:

KServe InferenceService

Refer to the following link for guidance.

Set up an authorized Azure Service Principal:

az ad sp create-for-rbac --name model-store-sp --role "Storage Blob Data Owner" --scopes /subscriptions/<your subscription id>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>

Edit the secrets stringData values file and run:

kubectl apply -n internal-apps -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: azcreds
type: Opaque
stringData:
  AZ_CLIENT_ID: <your AZ_CLIENT_ID>
  AZ_CLIENT_SECRET: <your AZ_CLIENT_SECRET>
  AZ_SUBSCRIPTION_ID: <your AZ_SUBSCRIPTION_ID>
  AZ_TENANT_ID: <your AZ_TENANT_ID>
EOF

Register and synchronize the ArgoCD application:

# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443

# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password

# e.g. for keras-mnist-inference chart
argocd app create keras-mnist-inference \
  --repo https://github.com/MGTheTrain/ml-ops-poc.git \
  --path gitops/argocd/keras-mnist-inference \ 
  --dest-server https://kubernetes.default.svc \
  --dest-namespace internal-apps \
  --revision main \
  --server localhost:8080

# In terminal process B - Sync Application
argocd app sync keras-mnist-inference

Due to AKS node resource constraints experiments related to InferenceServices trough KServe have been aborted:

The Inference Service pulls the tensorflow/serving docker image, which could lead to allocation issues due to its size of 1 to 1.5 GB.

Internal inference service

Create Blob secret:

kubectl create secret generic blob-secret --from-literal=blob_name=<mnist_model-20250206190322.h5> -n internal-apps

Register and synchronize the ArgoCD application:

# Port forward in terminal process A
kubectl port-forward -n external-services svc/argocd-server 8080:443

# In terminal process B - Login
argocd login localhost:8080
# Prompted to provide username and password

# e.g. for keras-mnist-internal-inference chart
argocd app create keras-mnist-internal-inference \
  --repo https://github.com/MGTheTrain/ml-ops-poc.git \
  --path gitops/argocd/keras-mnist-internal-inference \ 
  --dest-server https://kubernetes.default.svc \
  --dest-namespace internal-apps \
  --revision main \
  --server localhost:8080

# In terminal process B - Sync Application
argocd app sync keras-mnist-inference

Resulting inference service logs should resemble:

once the client has submitted a /predict request to the inference service:

mlflow

To access the MLflow dashboard following the installation of the MLflow Helm chart, execute the following command:

kubectl port-forward -n ml-ops-poc <mlflow pod name> 5000:5000

and visit in a browser of choice localhost:5000.

Destroy the AKS cluster, uninstall helm charts or remove kubernetes resources for applications

Optional: Uninstall only ml tools of an existing kubernetes cluster trough uninstall-helm-charts workflow
Optional: Destroy kubernetes resources for application (secrets or reverse-proxy ingress) trough delete-internal-k8s-resources workflow
Destroy an AKS trough the destroy-k8s-cluster workflow

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
.devcontainer		.devcontainer
.github		.github
devops		devops
gitops		gitops
images		images
notebooks		notebooks
python		python
terraform		terraform
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml-ops-poc

Table of Contents

Summary

References

Features

Getting started

Preconditions

Deploy an AKS cluster, install the kubeflow or mlflow components or setup kubernetes resources for applications

kubeflow

Jupyter notebooks

Applying TFJob or PyTorchJob k8s manifests

KServe InferenceService

Internal inference service

mlflow

Destroy the AKS cluster, uninstall helm charts or remove kubernetes resources for applications

About

Releases

Packages

Contributors 2

Languages

License

MGTheTrain/ml-ops-poc

Folders and files

Latest commit

History

Repository files navigation

ml-ops-poc

Table of Contents

Summary

References

Features

Getting started

Preconditions

Deploy an AKS cluster, install the kubeflow or mlflow components or setup kubernetes resources for applications

kubeflow

Jupyter notebooks

Applying TFJob or PyTorchJob k8s manifests

KServe InferenceService

Internal inference service

mlflow

Destroy the AKS cluster, uninstall helm charts or remove kubernetes resources for applications

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages