Skip to content

Commit b376fc8

Browse files
misohuafgambindeusebio
authored
feat: canonical k8s integration (#219)
* Fix MLflow deployment for Canonical k8s (#195) * Fix MLflow deployment for Canonical k8s * Fix init container name * Fix shared folder on Canonical k8s (#196) * feat: run integration tests on canonical k8s (#203) * feat: run integration tests on canonical k8s * fix: gpu tests to run on Canonical k8s (#204) * fix: run gpu ci on caonical k8s * feat: update docs for canonical k8s (#215) * feat: update docs for canonical k8s * Angel's review --------- Co-authored-by: afgambin <angel.fernandez@canonical.com> * fix: use security context for the mounted volumens (#214) * fix: store dss logs in snap common folder (#218) * fix: store logs in snap common folder --------- Co-authored-by: deusebio <edeusebio85@gmail.com> --------- Co-authored-by: afgambin <angel.fernandez@canonical.com> Co-authored-by: deusebio <edeusebio85@gmail.com>
1 parent 63a889e commit b376fc8

33 files changed

+242
-499
lines changed

.github/workflows/aws-integration-gpu.yaml

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,21 +40,48 @@ jobs:
4040
- name: Checkout repository
4141
uses: actions/checkout@v4
4242

43-
# Issue: https://github.com/canonical/data-science-stack/issues/116
44-
- name: Setup operator environment
45-
run: sudo -E bash .github/workflows/setup_environment.sh
46-
env:
47-
MICROK8S_CHANNEL: 1.28/stable
43+
- name: Install python
44+
run: |
45+
sudo add-apt-repository ppa:deadsnakes/ppa -y
46+
sudo apt update -y
47+
VERSION=3.10
48+
sudo apt install python3.10 python3.10-distutils python3.10-venv -y
49+
wget https://bootstrap.pypa.io/get-pip.py
50+
python3.10 get-pip.py
51+
python3.10 -m pip install tox
52+
rm get-pip.py
53+
54+
- name: Install and setup Canonical k8s
55+
run: |
56+
sudo snap install kubectl --classic
57+
sudo snap install k8s --classic --channel=1.32-classic/stable
58+
sudo k8s bootstrap
59+
sudo k8s enable local-storage
60+
mkdir -p ~/.kube
61+
sudo k8s config > ~/.kube/config
62+
sudo chown $(id -u):$(id -g) ~/.kube/config
63+
64+
- name: Enable NVIDIA operator
65+
run: |
66+
export KUBECONFIG=~/.kube/config
67+
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
68+
&& chmod 700 get_helm.sh \
69+
&& ./get_helm.sh
70+
71+
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
72+
&& helm repo update
73+
74+
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
4875
49-
- name: Setup microk8s
50-
run: sudo -E bash .github/workflows/configure_microk8s.sh
51-
env:
52-
MICROK8S_ADDONS: "storage dns rbac gpu"
76+
# Wait until the GPU operator validations pass
77+
while ! kubectl logs -n gpu-operator -l app=nvidia-operator-validator | grep "all validations are successful"; do
78+
echo "Waiting for GPU operator validations to pass..."
79+
sleep 5
80+
done
5381
54-
- name: Run tests as root
82+
- name: Run tests
5583
run: |
56-
sudo snap alias microk8s.kubectl kubectl
57-
tox -e integration-gpu -- -vv -s
84+
sudo tox -e integration-gpu -- -vv -s
5885
5986
stop-runner:
6087
name: Stop self-hosted EC2 runner

.github/workflows/build_and_test_snap.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ jobs:
2222
- name: Checkout repository
2323
uses: actions/checkout@v4
2424

25-
- uses: snapcore/action-build@v1.2.0
25+
- uses: snapcore/action-build@v1.3.0
2626
id: snapcraft
2727
with:
2828
# Use a deterministic file name, so we can easily reference it later

.github/workflows/configure_microk8s.sh

Lines changed: 0 additions & 18 deletions
This file was deleted.

.github/workflows/setup_environment.sh

Lines changed: 0 additions & 22 deletions
This file was deleted.

.github/workflows/tests.yaml

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,6 @@ jobs:
3636
# is resolved and applied to all ROCKs repositories
3737
- name: Install python version from input
3838
run: |
39-
# TODO: remove when fixed: https://github.com/microsoft/linux-package-repositories/issues/130#issuecomment-2074645171
40-
sudo rm /etc/apt/sources.list.d/microsoft-prod.list
4139
sudo add-apt-repository ppa:deadsnakes/ppa -y
4240
sudo apt update -y
4341
VERSION=3.10
@@ -47,11 +45,20 @@ jobs:
4745
python3.10 -m pip install tox
4846
rm get-pip.py
4947
50-
- uses: balchua/microk8s-actions@v0.3.2
51-
with:
52-
channel: '1.28/stable'
53-
addons: '["hostpath-storage"]'
48+
- name: Install dependencies
49+
run: |
50+
# Removing docker as it is blocking canonical k8s bootstrap
51+
# Based on this guide https://documentation.ubuntu.com/canonical-kubernetes/latest/snap/howto/install/dev-env/#containerd-conflicts
52+
sudo apt-get remove -y docker-ce docker-ce-cli containerd.io
53+
sudo rm -rf /run/containerd
54+
55+
sudo snap install kubectl --classic
56+
sudo snap install k8s --classic --channel=1.32-classic/stable
57+
sudo k8s bootstrap
58+
sudo k8s enable local-storage
59+
mkdir -p ~/.kube
60+
sudo k8s config > ~/.kube/config
5461
55-
- name: Install library
56-
run: sg microk8s -c "tox -vve integration"
62+
- name: Run tests
63+
run: tox -vve integration
5764

docs/explanation/dss-arch.rst

Lines changed: 19 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
DSS architecture
22
================
33

4-
This guide provides an overview of the Data science stack (DSS) architecture, its main components and their interactions.
4+
This guide provides an overview of the Data Science Stack (DSS) architecture, its main components, and their interactions.
55

66
DSS is a ready-to-run environment for Machine Learning (ML) and Data Science (DS).
7-
It's built on open-source tooling, including `MicroK8s`_, JupyterLab and `MLflow <https://ubuntu.com/blog/what-is-mlflow>`_.
7+
It's built on open-source tooling, including `Canonical K8s`_, JupyterLab, and `MLflow <https://ubuntu.com/blog/what-is-mlflow>`_.
88

99
DSS is distributed as a `snap`_ and usable on any Ubuntu workstation.
1010
This provides robust security management and user-friendly version control, enabling seamless updates and auto-rollback in case of failure.
@@ -52,22 +52,22 @@ ML tools
5252

5353
DSS includes:
5454

55-
* Jupyter Notebooks: Open source environment that provides a flexible interface to organise DS projects and ML workloads.
56-
* MLflow: Open source platform for managing the ML life cycle, including experiment tracking and model registry.
57-
* ML frameworks: DSS comes by default with PyTorch and Tensorflow. Users can manually add other frameworks, depending on their needs and use cases.
55+
* Jupyter Notebooks: Open-source environment that provides a flexible interface to organise DS projects and ML workloads.
56+
* MLflow: Open-source platform for managing the ML life cycle, including experiment tracking and model registry.
57+
* ML frameworks: DSS comes by default with PyTorch and TensorFlow. Users can manually add other frameworks, depending on their needs and use cases.
5858

5959
Jupyter Notebooks
6060
^^^^^^^^^^^^^^^^^
6161

62-
A `Jupyter Notebook <Jupyter Notebooks_>`_ is essentially a `Kubernetes deployment <Pod_>`_, also known as `Pod`, running a Docker image with Jupyter Lab and a dedicated ML framework, such as Pytorch or Tensorflow.
62+
A `Jupyter Notebook <Jupyter Notebooks_>`_ is essentially a `Kubernetes deployment <Pod_>`_, also known as `Pod`, running a Docker image with Jupyter Lab and a dedicated ML framework, such as PyTorch or TensorFlow.
6363
For each Jupyter Notebook, DSS mounts a `Hostpath <Microk8s hostpath docs_>`_ directory-backed persistent volume to the data directory.
6464
All Jupyter Notebooks share the same persistent volume, allowing them to exchange data seamlessly.
6565
The full path to that persistent volume is `/home/jovyan/shared`.
6666

6767
MLflow
6868
^^^^^^
6969

70-
`MLflow <https://ubuntu.com/blog/what-is-mlflow>`_ operates in `local mode <https://mlflow.org/docs/latest/tracking.html#other-configuration-with-mlflow-tracking-server>`_,
70+
`MLflow <https://ubuntu.com/blog/what-is-mlflow>`_ operates in `local mode <https://mlflow.org/docs/latest/tracking/#other-tracking-setup>`_,
7171
meaning that metadata and artefacts are, by default, stored in a local directory.
7272

7373
This local directory is backed by a persistent volume, mounted to a Hostpath directory of the MLflow Pod.
@@ -79,27 +79,22 @@ Orchestration
7979
~~~~~~~~~~~~~
8080

8181
DSS requires a container orchestration solution.
82-
DSS relies on `MicroK8s`_, a lightweight Kubernetes distribution.
82+
DSS relies on `Canonical K8s`_, a lightweight Kubernetes distribution.
8383

84-
Therefore, MicroK8s needs to be deployed before installing DSS on the host machine.
85-
It must be configured with the storage add-on.
86-
This is required to use Hostpath storage in the cluster.
87-
See :ref:`set_microk8s` to learn how to install MicroK8s.
84+
Therefore, Canonical K8s needs to be deployed before installing DSS on the host machine.
85+
It must be configured with local storage support to handle persistent volumes used by DSS.
8886

8987
.. _gpu_support:
9088

9189
GPU support
9290
^^^^^^^^^^^
9391

9492
DSS can run with or without the use of GPUs.
95-
If needed, MicroK8s can be configured with the desired `GPU add-on <https://microk8s.io/docs/addon-gpu>`_.
96-
97-
DSS is designed to support the deployment of containerised GPU workloads on NVIDIA GPUs.
98-
MicroK8s simplifies the GPU access and usage through the `NVIDIA GPU Operator <NVIDIA Operator_>`_.
93+
If needed, follow `NVIDIA GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html>`_ for deployment details.
9994

10095
DSS does not automatically install the tools and libraries required for running GPU workloads.
101-
To do so, it relies on MicroK8s for the required operating-system drivers.
102-
It also relies on the chosen image, for example, CUDA when working with NVIDIA GPUs.
96+
It relies on Canonical K8s for the required operating-system drivers.
97+
It also depends on the chosen image, for example, CUDA when working with NVIDIA GPUs.
10398

10499
.. caution::
105100
GPUs from other silicon vendors rather than NVIDIA can be configured. However, its functionality is not guaranteed.
@@ -108,16 +103,16 @@ Storage
108103
^^^^^^^
109104

110105
DSS expects a default `storage class <https://kubernetes.io/docs/concepts/storage/storage-classes/>`_ in the Kubernetes deployment, which is used to persist Jupyter Notebooks and MLflow artefacts.
111-
In MicroK8s, the Hostpath storage add-on is chosen, used to provision Kubernetes' *PersistentVolumeClaims* (`PVCs <https://kubernetes.io/docs/concepts/storage/persistent-volumes/>`_).
106+
In Canonical K8s, a local storage class should be configured to provision Kubernetes' *PersistentVolumeClaims* (`PVCs <https://kubernetes.io/docs/concepts/storage/persistent-volumes/>`_).
112107

113108
A shared PVC is used across all Jupyter Notebooks to share and persist data.
114109
MLflow also uses its dedicated PVC to store the logged artefacts.
115110
This is the DSS default storage configuration and cannot be altered.
116111

117-
This choice ensures that all storage is backed up on the host machine in the event of MicroK8s restarts.
112+
This choice ensures that all storage is backed up on the host machine in the event of cluster restarts.
118113

119114
.. note::
120-
By default, you can access the DSS storage anytime under your local directory `/var/snap/microk8s/common/default-storage`.
115+
By default, you can access the DSS storage anytime under your local directory `/var/snap/k8s/common/default-storage`.
121116

122117
The following diagram summarises the DSS storage:
123118

@@ -132,7 +127,7 @@ The following diagram summarises the DSS storage:
132127
Operating system
133128
~~~~~~~~~~~~~~~~
134129

135-
DSS is native on Ubuntu, being developed, tested and validated on it.
130+
DSS is native on Ubuntu, being developed, tested, and validated on it.
136131
Moreover, the solution can be used on any Linux distribution.
137132

138133
Namespace configuration
@@ -147,8 +142,7 @@ This includes the GPU Operator for managing access and usage.
147142
Accessibility
148143
-------------
149144

150-
Jupyter Notebooks and MLflow can be accessed from a web browser through the Pod IP that is given access through MicroK8s.
145+
Jupyter Notebooks and MLflow can be accessed from a web browser through the Pod IP that is given access through Canonical K8s.
151146
See :ref:`access_notebook` and :ref:`access_mlflow` for more details.
152147

153-
154-
148+
.. _Canonical K8s: https://snapcraft.io/k8s

docs/how-to/dss.rst

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ This guide describes how to manage Data Science Stack (DSS).
88
DSS is a Command Line Interface (CLI)-based environment and distributed as a `snap`_.
99

1010
Install
11-
--------
11+
-------
1212

1313
.. note::
14-
To install DSS, ensure you have previously installed `Snap`_ and `MicroK8s`_.
14+
To install DSS, ensure you have previously installed `Snap`_ and `Canonical K8s`_.
1515

1616
You can install DSS using ``snap`` as follows:
1717

@@ -26,20 +26,20 @@ Then, you can run the DSS CLI with:
2626
dss
2727
2828
Initialise
29-
-----------
29+
----------
3030

3131
You can initialise DSS through ``dss initialize``.
3232
This command:
3333

34-
* Stores credentials for the MicroK8s cluster.
34+
* Stores credentials for the Canonical K8s cluster.
3535
* Allocates storage for your DSS Jupyter Notebooks.
3636
* Deploys an `MLflow <MLflow Docs_>`_ model registry.
3737

3838
.. code-block:: shell
3939
40-
dss initialize --kubeconfig "$(sudo microk8s config)"
40+
dss initialize --kubeconfig "$(sudo k8s config)"
4141
42-
The ``--kubeconfig`` option is used to provide your MicroK8s cluster's kubeconfig.
42+
The ``--kubeconfig`` option is used to provide your Canonical K8s cluster's kubeconfig.
4343

4444
.. note::
4545
Note the use of quotes for the ``--kubeconfig`` option. Without them, the content may be interpreted by your shell.
@@ -61,9 +61,9 @@ You should expect an output like this:
6161
dss create my-notebook --image=kubeflownotebookswg/jupyter-scipy:v1.8.0
6262
6363
Remove
64-
-------
64+
------
6565

66-
You can remove DSS from your MicroK8s cluster through ``dss purge``.
66+
You can remove DSS from your Canonical K8s cluster through ``dss purge``.
6767
This command purges all the DSS components, including:
6868

6969
* All Jupyter Notebooks.
@@ -72,8 +72,8 @@ This command purges all the DSS components, including:
7272

7373
.. note::
7474

75-
This action removes the components of the DSS environment, but it does not remove the DSS CLI or your MicroK8s cluster.
76-
To remove those, `delete their snaps <https://snapcraft.io/docs/quickstart-tour>`_.
75+
This action removes the components of the DSS environment, but it does not remove the DSS CLI or your Canonical K8s cluster.
76+
To remove those, `delete their snaps <https://snapcraft.io/docs/get-started>`_.
7777

7878
.. code-block:: bash
7979
@@ -91,7 +91,7 @@ You should expect an output like this:
9191
Success: All DSS components and notebooks purged successfully from the Kubernetes cluster.
9292
9393
Get status
94-
-----------
94+
----------
9595

9696
You can check the DSS status through ``dss status``.
9797
This command provides a quick way to check the status of your DSS environment, including the MLflow status and whether a GPU is detected in your environment.
@@ -109,7 +109,7 @@ If you already have a DSS environment running and no GPU available, the expected
109109
GPU acceleration: Disabled
110110
111111
List commands
112-
--------------
112+
-------------
113113

114114
You can get the list of available commands for DSS through the ``dss`` command with the ``--help`` option:
115115

@@ -134,12 +134,11 @@ You should expect an output like this:
134134
list Lists all created notebooks in the DSS environment.
135135
logs Prints the logs for the specified notebook or DSS component.
136136
purge Removes all notebooks and DSS components.
137-
remove Remove a Jupter Notebook in DSS with the name NAME.
137+
remove Remove a Jupyter Notebook in DSS with the name NAME.
138138
start Starts a stopped notebook in the DSS environment.
139139
status Checks the status of key components within the DSS...
140140
stop Stops a running notebook in the DSS environment.
141141
142-
143142
**Get details about a specific command**:
144143

145144
To see the usage and options of a DSS command, run ``dss <command>`` with the ``--help`` option.
@@ -174,4 +173,4 @@ See also
174173
--------
175174

176175
* To learn how to manage your Jupyter Notebooks, check :ref:`manage_notebooks`.
177-
* If you are interested in managing MLflow within your DSS environment, see :ref:`manage_MLflow`.
176+
* If you are interested in managing MLflow within your DSS environment, see :ref:`manage_MLflow`.

0 commit comments

Comments
 (0)