disclaimer

this is a research project. This is more of a prototype; This is not long-lasting software and may be obsolete on the next version of kubeflow, therefore this piece of software is missing substantial quality control measures; i.e. we wrote some tests, but only to speed up development. It's more of a short-term fix. :)

use-case of kubeflow-extensions

Kubeflow has partial multi tenancy support, we increase the support to katib and kubeflow pipelines, when utilized with e.g. kale

Additionally, we fix some integration issues with jupyter notebooks. These Pods may access the ml-pipeline; which results in errors.

What is done and why

listen to new namespaces
pod admission controller - for python notebooks
trials admission controller
job admission controller
pod admission controller - for katib

1) listen to new namespaces, why?

When a new namespace is created, we need to create a new ServiceRoleBinding for the new namespace allowing partial access to kubeflow pipelines out of the new namespace from pods via istio-sidecars

(this is only done, when a corresponding profile does exist)

Additionally, on namespace creation we need to create an EnvoyFilter, allowing the full access (and therefore usability) of kubeflow pipelines out from other namespaces. (based on this issue kubeflow/pipelines#4440 (comment))

The EnvoyFilter will utilize a label (e.g. kubeflow-extension = true) for identification of access. The EnvoyFilter will only allow access to the namespace owner. Another user with granted access won't be enabled to run any ml-pipeline or katib code

2) pod admission controller for python notebooks, why?

when a new pod gets created, we inject a label "kubeflow-extension = true", therefore allowing this pod access via istio-sidecar to ml-pipelines (EnvoyFilter will identify this sidecar, and therefore allow access to ml-pipelines api

3) trials admission controller, why

when a new trial gets created, we need to check whether modification is necessary. When a trial gets started from within katib, they'll run in the kubeflow namespace (at this point in time (12.10.2020)). Modification of these trials is unnecessary.

Is the trial not created in the kubeflow namespace, we add a kubeflow-extension=true label for identification in following steps (of corresponding jobs and pods).

4) job admission controller, why?

tag the pod with kubeflow-extension label (for identification and access through envoy filter)
additionally we change the service account to default-editor (because the default editor most definitely has started this job, and the pipeline-runner SA does not have enough rights for access through RBAC of Envoy / istio)
finally we set the istio sidecar injection label. doing this in the pod is too late - the pod has already run through the istio webhook

5) pod admission controller, why?

multiple containers in the pod need small behaviour changes

three containers exist in the created pod.

experiment executor container
katib metrics collector container
istio container
the experiment pod must sleep some seconds initially, or the istio proxy is not yet working. unfortunately some extensions (kale); aren't that robust
the katib metrics collector container must kill the istio container after execution

In the second step we'll find additional problems, regarding exit codes etc.. this is explained in detail within the code.

limitations

This whole ordeal is experimental. We tried to run kubeflow in vSphere within a TKG cluster. While we hit some issues, we tried to fix it, so we can play a bit with this technology within our lab.

Not everything is working perfectly. This may be fixed by kubeflow in the future, it's enough for us to explore the full capabilities of kubeflow to an extent not yet visible in any other project.

experiments suddenly fail without reason

this is a limitation by the utilized ist.io version (1.3) Unfortunately there does not seem to be a fix. Although, we found a symptom mitigation strategy (error does happen substantially less) we think it's because of the following issue istio/istio#17383

experiments are not finishing

Experiments just stop running after a while the reason is unknown. It may happens because of some fixes we introduce or not. Sometimes they complete, sometimes not

installation

apply k8s resources for NS, RBAC
build & push docker image
apply k8s resource for deployment
create a certificate for admission review registration
apply k8s resource for admission webhooks

#setup k8s resources
#ns, role, role binding for psp and role binding for api-server access
kubectl apply -f resources.yaml

#build package and push docker image
mvn package -DskipTests
docker build -f src/main/docker/Dockerfile.jvm -t keeyzar/tenancy-fixer-jvm .
docker push keeyzar/tenancy-fixer-jvm

#create certificate request and sign..
#make sure, go; cfssl cfssljson are installed
chmod +x create-cert.sh
./create-cert.sh

#deploy tenancy-fixer
kubectl apply -f tenancy-fixer.yaml
mutator=$(kubectl get pods --selector=app=tenancy-fixer -ojsonpath='{.items[*].metadata.name}')
kubectl wait --for=condition=Ready --timeout=300s pod/$mutator

#obtain certificate from pod, which the api-server should utilize as public-key 
controller=$(kubectl get pods --selector=app=tenancy-fixer -o jsonpath='{.items[*].metadata.name}')
cert=$(kubectl exec $controller -- cat /var/run/secrets/kubernetes.io/serviceaccount/ca.crt | base64 | tr -d '\n')
sed -i.bak -E "s/caBundle:.*?/caBundle: $cert/" webhooks.yaml
kubectl apply -f webhooks.yaml


#done :)

debugging

for debugging purpose get logs of tenancy-fixer-* in kubeflow-extension namespace

k logs -n kubeflow-extension --selector=app=tenancy-fixer --tail=-1

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
MODIFICATIONS		MODIFICATIONS
README.md		README.md
create-cert.sh		create-cert.sh
csr.json		csr.json
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
resources.yaml		resources.yaml
tenancy-fixer.yaml		tenancy-fixer.yaml
webhooks.yaml		webhooks.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

disclaimer

use-case of kubeflow-extensions

What is done and why

1) listen to new namespaces, why?

2) pod admission controller for python notebooks, why?

3) trials admission controller, why

4) job admission controller, why?

5) pod admission controller, why?

limitations

installation

debugging

About

Releases

Packages

Languages

License

SoeldnerConsult/kubeflow-extensions

Folders and files

Latest commit

History

Repository files navigation

disclaimer

use-case of kubeflow-extensions

What is done and why

1) listen to new namespaces, why?

2) pod admission controller for python notebooks, why?

3) trials admission controller, why

4) job admission controller, why?

5) pod admission controller, why?

limitations

installation

debugging

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages