- Upload model to HuggingFace modelhub
- Productionize and modularize notebook code
- Makefile
- Dockerize
- Integrate poetry
- Separate the fastAPI backend and gradio frontend services in different containers
- Handle them with Docker compose
- GitHub CI/CD
- Use a self-hosted CI/CD runner
- Use a self-hosted docker image repository
- Self-host the deployment environment
- Clean-up old images/containers on the prod server
- Enable image caching in the self-hosted runner
- Create an automated release when a new tag is created
- FOR FUTURE IMPROVEMENTS: Tag docker images with release/commit tag
- FOR FUTURE IMPROVEMENTS: Alternative tools would be Jenkins, Argo, CircleCI
- Code styling, linting, security and tests
- Pytest for unit/mock/integration tests
- Code coverage
- Bandit and pip-audit for catching security flaws
- Mypy for typechecking
- ruff for linting
- Integrate tests in the CI/CD (Optionally use Nox to orchestrate)
- FOR FUTURE IMPROVEMENTS: More strict code styling and introduce pre-commit hooks
- FOR FUTURE IMPROVEMENTS: Stub files that mypy can use for the custom modules (model_utils.py)
- Logs and messages
- Logging in all src codes
- Loki + Promtail
- Prometheus
- Grafana
- Create metrics for Prometheus
- Create a few Grafana dashboards
- FOR FUTURE IMPROVEMENTS: Kafka
- FOR FUTURE IMPROVEMENTS: ELK (Elasticsearch, Logstash, Kibana)
- Deployment and orchestration
- Kubernetes theoretical + terminology - cluster, master, workers, pods, deployment, ConfigMaps, secrets, services, ingress, HorizontalPodAutoscaler, rolling updates, probes (liveness/readiness)
- Kubernetes practical
- Set-up the full-scale-like production like cluster with multiple machines
- Move the self hosted docker image repository on the cluster
- Set-up the GPUs on the K8s cluster
- Deploy the App using a raw K8s manifest
- Get the unique IDs to know from which container the response comes
- Helmify the kubernetes yaml files (create Helmcharts)
- K9s workflow
- Kserve
- Kubeflow
- Code refactoring and finishing touches
- Training/Validation/Testing scripts and modularity
- Documentation (MkDocs / Sphinx) + Docstrings
- Documentation for the GitHub Read.me - how to start the app, etc.
- Use the GPU instead of the CPU in model_utils
- [IN PROGRESS] Create an diagram for the entire architecture (CI/CD, model retraining, etc.)
- FOR FUTURE IMPROVEMENTS: Deploy in AWS - EC2 or ECS
- FOR FUTURE IMPROVEMENTS: Terraform/Ansible for infrastructure
- FOR FUTURE IMPROVEMENTS: User feedback system for model accuracy, Model tracking, data drift tracking, automated retraining (Evidently)
- FOR FUTURE IMPROVEMENTS: Airflow/Dagster/Prefect/Argo
- FOR FUTURE IMPROVEMENTS: Weights & Biases (W&B) - paid alternative to MLflow
- FOR FUTURE IMPROVEMENTS: ONNX/TorchScript (for Pytorch environment) - used to serialize and export models for cross-framework compatibility and efficient inference
- FOR FUTURE IMPROVEMENTS: TorchServe / TensorFlow Serving / Triton Inference Server - for serving models at scale
- FOR FUTURE IMPROVEMENTS: DVC for storing and versioning data and model weights
- FOR FUTURE IMPROVEMENTS: Distributed Computing - Dask, Spark
- FOR FUTURE IMPROVEMENTS: GPU accelerated data science: CuDF, CuML
- FOR FUTURE IMPROVEMENTS: Feature stores
- PyTorch Lightning - streamline the process of developing, training, and scaling deep learning pytorch models
Useful MLOps learning path: https://github.com/graviraja/MLOps-Basics