Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adding Turing cluster to the Federation #1203

Merged
merged 72 commits into from
Jan 15, 2020

Conversation

sgibson91
Copy link
Member

@sgibson91 sgibson91 commented Oct 15, 2019

Summary

This PR will add the Turing cluster to the Federation. I will try to keep this top comment updated with implemented changes.

Related issues: #1154, jupyterhub/team-compass#205, jupyterhub/team-compass#217

What's changed

  • Added config/turing.yaml
  • Added secrets/config/turing.yaml containing Turing secrets
  • Added Azure auth to deploy.py and included secrets/turing-auth-key-prod.json
  • Added an extra run of deploy.py for turing cluster to .travis.yml
  • Configured Turing cluster only for cert-manager (instead of kube-lego), included cluster-issuer.yaml

TODOs before ready for review

  • Add credentials to secrets
  • Adjust deploy.py
  • Add extra run of deploy.py to .travis.yml
  • turing.mybinder.org in config/turing.yaml and config/turing/turing_mybinder_org_ingress.yaml may need to be swapped for a test subdomain

TODOs for review

  • Check values in config/turing.yaml
  • Check secrets/turing.yaml

TODOs before merging

- Copy ovh.yaml to turing.yaml
- Remove refs to ovh throughout
- Leave blank the things I don't know yet 😄
- Configuring ingress for turing.mybinder.org
- May need replacing with test subdomain
- Took some guesses at the various hosts
@betatim
Copy link
Member

betatim commented Oct 15, 2019

We will also have to add the credentials to secrets/, make some adjustments to deploy.py and then add an extra run of deploy.py to .travis.yml.

We should do a quick review of all the support services (matomo, gcs, redirector, etc) to see which of those we need and which we should turn off for members of the federation.

@sgibson91
Copy link
Member Author

@betatim @minrk @choldgraf (and anyone else!) Can I get an early review of this? I'm probably missing a whole load of stuff, so planning the next stages of the PR would be very useful for me. Thanks!

.travis.yml Outdated Show resolved Hide resolved
Thanks @manics!

Co-Authored-By: Simon Li <[email protected]>
@betatim
Copy link
Member

betatim commented Oct 22, 2019

I'd start deploying this to see what happens. Two things to do before: temp domains and removing some of the top level keys that were mentioned in the team-compass issue as "we probably don't need those".

Before we do this we need to replace all the hostnames with nip.io/xip.io domains. You can do a test of the test by running deploy.py locally. Then I'd start deploying it from this repo because "why not".

@sgibson91
Copy link
Member Author

Running deploy.py locally now

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 22, 2019

Running deploy.py locally is hanging in setup_helm function waiting for tiller to come up.

Console output:

$ python deploy.py prod turing
The behavior of this command has been altered by the following extension: aks-preview
Merged "prod" as current context in /Users/sgibson/.kube/config

$HELM_HOME has been configured at /Users/sgibson/.helm.

Tiller (the Helm server-side component) has been upgraded to the current version.
Waiting for deployment "tiller-deploy" rollout to finish: 0 of 1 updated replicas are available...

Pod status:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
kube-system   coredns-7fc597cc45-jjvnn               1/1     Running   0          3d15h
kube-system   coredns-7fc597cc45-jwhfl               1/1     Running   0          3d15h
kube-system   coredns-autoscaler-65d7986c6b-8wg25    1/1     Running   0          7d
kube-system   kube-proxy-dt4n9                       1/1     Running   0          3d15h
kube-system   kube-proxy-gh287                       1/1     Running   0          3d15h
kube-system   kube-proxy-pslsz                       1/1     Running   0          3d15h
kube-system   kube-proxy-v6qlc                       1/1     Running   0          59s
kube-system   kube-proxy-x54b6                       1/1     Running   0          3d15h
kube-system   kubernetes-dashboard-cc4cc9f58-4jq82   1/1     Running   0          7d
kube-system   metrics-server-66dbbb67db-qdbrm        1/1     Running   0          7d
kube-system   tiller-deploy-647648f55f-6pdxb         0/1     Pending   0          40m
kube-system   tunnelfront-64d7844cb8-nj4pv           1/1     Running   0          7d

Events in tiller-deploy pod (no-longer truncated):

$ kubectl describe pod tiller-deploy-6cb49b745c-mm6q4 -n kube-system 
Name:           tiller-deploy-6cb49b745c-mm6q4
Namespace:      kube-system
Priority:       0
Node:           <none>
Labels:         app=helm
                name=tiller
                pod-template-hash=6cb49b745c
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/tiller-deploy-6cb49b745c
Containers:
  tiller:
    Image:       gcr.io/kubernetes-helm/tiller:v2.11.0
    Ports:       44134/TCP, 44135/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      /tiller
      --listen=localhost:44134
    Liveness:   http-get http://:44135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:44135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      TILLER_NAMESPACE:              kube-system
      TILLER_HISTORY_MAX:            0
      KUBERNETES_PORT_443_TCP_ADDR:  prod-binder-prod-7468bc-f0ec1c2a.hcp.westeurope.azmk8s.io
      KUBERNETES_PORT:               tcp://prod-binder-prod-7468bc-f0ec1c2a.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://prod-binder-prod-7468bc-f0ec1c2a.hcp.westeurope.azmk8s.io:443
      KUBERNETES_SERVICE_HOST:       prod-binder-prod-7468bc-f0ec1c2a.hcp.westeurope.azmk8s.io
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from tiller-token-mmcg6 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  tiller-token-mmcg6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tiller-token-mmcg6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  mybinder.org/pool-type=core
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Warning  FailedScheduling   24s (x4 over 3m4s)   default-scheduler   0/5 nodes are available: 5 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  2s (x18 over 2m55s)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match node selector

I do have labels on the three nodepools, maybe that's affecting this?

  • aks-core-... -> nodepurpose=core
  • aks-user-... -> nodepurpose=user
  • aks-default-... -> no label

Above may not be true, I might be getting confused with Hub23. Here are the labels the nodes do have:

$ kubectl get nodes --show-labels                                   
NAME                              STATUS   ROLES   AGE   VERSION   LABELS
aks-core-28368152-vmss000000      Ready    agent   7d    v1.14.6   agentpool=core,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westeurope,failure-domain.beta.kubernetes.io/zone=0,kubernetes.azure.com/cluster=MC_binder-prod_prod_westeurope,kubernetes.azure.com/role=agent,kubernetes.io/arch=amd64,kubernetes.io/hostname=aks-core-28368152-vmss000000,kubernetes.io/os=linux,kubernetes.io/role=agent,node-role.kubernetes.io/agent=,storageprofile=managed,storagetier=Premium_LRS
aks-default-28368152-vmss000000   Ready    agent   7d    v1.14.6   agentpool=default,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westeurope,failure-domain.beta.kubernetes.io/zone=0,kubernetes.azure.com/cluster=MC_binder-prod_prod_westeurope,kubernetes.azure.com/role=agent,kubernetes.io/arch=amd64,kubernetes.io/hostname=aks-default-28368152-vmss000000,kubernetes.io/os=linux,kubernetes.io/role=agent,node-role.kubernetes.io/agent=,storageprofile=managed,storagetier=Premium_LRS
aks-user-28368152-vmss000000      Ready    agent   7d    v1.14.6   agentpool=user,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westeurope,failure-domain.beta.kubernetes.io/zone=0,kubernetes.azure.com/cluster=MC_binder-prod_prod_westeurope,kubernetes.azure.com/role=agent,kubernetes.io/arch=amd64,kubernetes.io/hostname=aks-user-28368152-vmss000000,kubernetes.io/os=linux,kubernetes.io/role=agent,node-role.kubernetes.io/agent=,storageprofile=managed,storagetier=Premium_LRS
aks-user-28368152-vmss000001      Ready    agent   7d    v1.14.6   agentpool=user,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westeurope,failure-domain.beta.kubernetes.io/zone=1,kubernetes.azure.com/cluster=MC_binder-prod_prod_westeurope,kubernetes.azure.com/role=agent,kubernetes.io/arch=amd64,kubernetes.io/hostname=aks-user-28368152-vmss000001,kubernetes.io/os=linux,kubernetes.io/role=agent,node-role.kubernetes.io/agent=,storageprofile=managed,storagetier=Premium_LRS

Should I manually set nodeSelectors?

@sgibson91
Copy link
Member Author

Did kubectl rollout undo deployment tiller-deploy -n kube-system which got the tiller-deploy pod out of Pending status for the time-being.

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 22, 2019

Further info:

@sgibson91
Copy link
Member Author

Above issue resolved: args to deploy.py needed to be turing turing not prod turing.

Now, we get a different issue:

$ python deploy.py turing turing
The behavior of this command has been altered by the following extension: aks-preview
Merged "prod" as current context in /Users/sgibson/.kube/config

$HELM_HOME has been configured at /Users/sgibson/.helm.

Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!
deployment "tiller-deploy" successfully rolled out
Updating network-bans for turing
Traceback (most recent call last):
  File "secrets/ban.py", line 82, in <module>
    update_stub_dns(opts.context)
  File "secrets/ban.py", line 69, in update_stub_dns
    "kube-dns", "kube-system", {"data": {"stubDomains": stub_json}}
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 14714, in patch_namespaced_config_map
    (data) = self.patch_namespaced_config_map_with_http_info(name, namespace, body, **kwargs)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 14814, in patch_namespaced_config_map_with_http_info
    collection_formats=collection_formats)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 393, in request
    body=body)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/site-packages/kubernetes/client/rest.py", line 286, in PATCH
    body=body)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'cfdf6b3f-c79c-40a8-85bb-eef1c690f851', 'Content-Type': 'application/json', 'Date': 'Tue, 22 Oct 2019 11:31:10 GMT', 'Content-Length': '196'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"configmaps \"kube-dns\" not found","reason":"NotFound","details":{"name":"kube-dns","kind":"configmaps"},"code":404}


Traceback (most recent call last):
  File "deploy.py", line 222, in <module>
    main()
  File "deploy.py", line 218, in main
    deploy(args.release)
  File "deploy.py", line 152, in deploy
    "secrets/ban.py",
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python3', 'secrets/ban.py']' returned non-zero exit status 1.

This commit implements @minrk's suggestion of exposing a configMap
as a variable. This (hopefully) circumvents the issue that the
AKS cluster runs coredns whereas GKE runs kube-dns.
@sgibson91
Copy link
Member Author

This is where I'm up to now. Not quite sure what the problem is here.

$ python deploy.py turing turing               
The behavior of this command has been altered by the following extension: aks-preview
Merged "turing" as current context in /Users/sgibson/.kube/config

$HELM_HOME has been configured at /Users/sgibson/.helm.

Tiller (the Helm server-side component) has been updated to gcr.io/kubernetes-helm/tiller:v2.15.0 .
deployment "tiller-deploy" successfully rolled out
Updating network-bans for turing
Starting helm upgrade for turing
Release "turing" does not exist. Installing it now.
Error: render error in "mybinder/charts/binderhub/templates/deployment.yaml": template: mybinder/charts/binderhub/templates/deployment.yaml:9:15: executing "mybinder/charts/binderhub/templates/deployment.yaml" at <eq .Values.replicas 1.0>: error calling eq: incompatible types for comparison
Traceback (most recent call last):
  File "deploy.py", line 233, in <module>
    main()
  File "deploy.py", line 227, in main
    deploy(args.release, "turing")
  File "deploy.py", line 176, in deploy
    subprocess.check_call(helm)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--namespace', 'turing', 'turing', 'mybinder', '--force', '--wait', '--timeout', '600', '-f', 'config/turing.yaml', '-f', 'secrets/config/common.yaml', '-f', 'secrets/config/turing.yaml']' returned non-zero exit status 1.

@sgibson91
Copy link
Member Author

Switched helm version to 2.11.0, error in #1203 (comment) has now been replaced with the following.

$ python deploy.py turing turing
The behavior of this command has been altered by the following extension: aks-preview
Merged "turing" as current context in /Users/sgibson/.kube/config

$HELM_HOME has been configured at /Users/sgibson/.helm.

Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!
deployment "tiller-deploy" successfully rolled out
Updating network-bans for turing
Starting helm upgrade for turing
Release "turing" does not exist. Installing it now.
Error: render error in "mybinder/templates/static/ingress.yaml": template: mybinder/templates/static/ingress.yaml:21:29: executing "mybinder/templates/static/ingress.yaml" at <.Values.static.ingre...>: can't evaluate field hosts in type interface {}
Traceback (most recent call last):
  File "deploy.py", line 233, in <module>
    main()
  File "deploy.py", line 227, in main
    deploy(args.release, "turing")
  File "deploy.py", line 176, in deploy
    subprocess.check_call(helm)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--namespace', 'turing', 'turing', 'mybinder', '--force', '--wait', '--timeout', '600', '-f', 'config/turing.yaml', '-f', 'secrets/config/common.yaml', '-f', 'secrets/config/turing.yaml']' returned non-zero exit status 1.

Seems to be complaining about the {{ }} templating features in mybinder/templates/static/ingress.yaml.

@sgibson91
Copy link
Member Author

I'm kind of stuck in that Let's Encrypt issues a certificate for hub.mybinder.turing.ac.uk but not binder.mybinder.turing.ac.uk, and I'm not sure how to fix :/

@choldgraf
Copy link
Member

Perhaps this is something that @bitnik or @arnim had to figure out for the GESIS hub and they could provide advice?

@sgibson91
Copy link
Member Author

sgibson91 commented Jan 7, 2020

I think @consideRatio is the most likely to know what's going on here, unfortunately for us he's enjoying his vacation! Maybe @yuvipanda has some insight?

Further info: #1154 (comment)

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgibson91 i looked through changes. I didnt notice any configuration of certmanagers helm chart. And, if that isnt done, it will require some annotations on ingress objects to care about them other than the tls-acme: true annotation which acts like a "use defaults etc, lets go!".

Things that can go wrong in certificate allocations that i debug.

  1. Does the ingress object exist? If not, its not about cert-manager.
  2. Does the secret named inthe ingress configuration under tls: exists? If not...
  3. Does the cert-manager controller even attempt to get a certificate? Check its logs and check if a kubernetes certificate object has been created. If it has then annotations on the ingress resource was enough to trigger an attempt. Use kubectl describe certificate ... on it to see status and events on it that summarize the work done by the certmanager controller to get a tls-secret.

Typical issues:

  • the controller doesnt attempt toget a certificate, because lacking valid default configuration or annotations overriding the default configuration on the ingress objects
  • the certmanager controller does attempt, but fails. its attempt works like this afaik: a) speak with lets encrypt that it wants to proove itself that it in control over a domain and wants a certificate, b) it gets challanged to ensure that a request sent to the domain with a certain path will respond with a specific key by letsencrypt using the http01 kind of ACME challenge, c) the certmanager controller creates a pod to be the responding webserver replying with the key, d) the certmanager controller now need to ensure traffic is redirected to the pod, so it creates a new temp k8s ingress resource to direct traffic there (unless it modifies the existing one, which is also a possibility). e) it saticefies the challange if the pod receives traffic and the pod can respond back to letsencrypt as well. f) it receives a certificate from letsencrypt and creates a secret named something based on whats declared in the ingress and updates status about when to update it in the k8s certificate resource, which is a custom resource maintained by certmanager just like the issuers are.

Debugging ideas:

  • logs of the certmanager controller
  • describe on certificates created by the controller
  • get on secrets, temporary ingress resources, and pods, to see if they are created by the certmanager controller during the challange
  • inspection of certmanagers helm charts configuration, especially about its ingress shim section where defaults can be provided about what kind of challange and issuer is to be used etc.

Question:

  • the hub.something domain got a certificate, did it get it by certmanager? The there should be a certificate resource btw, i doubt that because that ingress had no configuration though annotation about what issuer to use, and the certmanager helm chwrt isnt configured to have a default issuer yet.

Terminology:

  • "issuer" is a certmanager concept to reference something that can provide a certificate and speak in ACME protocol while doing so. As there are more options than letsencrypt, certmanager is agnostic to the issuer and ensure you actively specify one or configure one as the default one.

config/turing.yaml Show resolved Hide resolved
.travis.yml Show resolved Hide resolved
@sgibson91
Copy link
Member Author

sgibson91 commented Jan 7, 2020

The certificate kubelego-tls-binder-turing keeps being created and I have no idea where in the config that name is being set. When I do a search on the repo, it comes back with nothing. I'm expecting to find the certificate turing-binder-tls-crt but that does not show up from kubectl get certificates.

EDIT: Fixed this! Needed to set binderhub.ingress.https.enabled = false.

@choldgraf
Copy link
Member

@sgibson91 - can you think of a place in the docs that was misleading, or where this information could have been placed? :-)

@sgibson91
Copy link
Member Author

I should have followed these docs 🤦‍♀https://binderhub.readthedocs.io/en/latest/https.html#adjust-binderhub-config-to-serve-via-https

The problem was this if statement in the ingress template overwriting my given secret name because I hadn't changed the https type to nginx.

@sgibson91
Copy link
Member Author

Just upgraded the config to include:

https:
  enabled: true
  type: nginx

and the certificate name is still overwritten from turing-binder-tls-crt to kubelego-tls-binder-turing - so there must be a bug in that if statement that I linked to. As I expected it to have the name that I assigned to it.

However, LET'S ENCRYPT IS NOW WORKING!!!!!! 🎉 🎉 🎉

@sgibson91
Copy link
Member Author

sgibson91 commented Jan 7, 2020

Remaining TODOs:

@sgibson91 sgibson91 changed the title [WIP] Adding Turing cluster to the Federation [MRG] Adding Turing cluster to the Federation Jan 7, 2020
@sgibson91
Copy link
Member Author

sgibson91 commented Jan 9, 2020

CNAME has been propagated! I recommend merging on Monday as I'm running a workshop today/tomorrow and won't be available to put out fires. I don't think the grafana password is a blocker to merging.

@choldgraf
Copy link
Member

Hey all - I'll merge this tomorrow morning if @sgibson91 agrees that's a good idea!

@sgibson91
Copy link
Member Author

@choldgraf yes I definitely want to merge! Can we check I've done the "right thing" with the grafana password first please?

@choldgraf
Copy link
Member

Hmmm - I am out-of-the-loop on that topic so I don't think I'll be helpful in figuring it out. What is the way that we can check this?

@betatim
Copy link
Member

betatim commented Jan 15, 2020

Ill take a look right now. I have 24minutes uninterrupted till my train arrives :)

@betatim betatim merged commit b21325b into master Jan 15, 2020
@betatim betatim deleted the sgibson91/add-turing-to-federation branch January 15, 2020 17:11
@betatim
Copy link
Member

betatim commented Jan 15, 2020

This uses the same password for the Turing grafana as for the GKE. I think there is no reason why they have to be the same, so in a future PR we could change the Turing one. We could even uninstall grafana from the cluster as we've found how to connect the different prometheus instances to the grafana that runs on GKE. This means we only have to maintain one set of dashboards.

@choldgraf
Copy link
Member

turing.mybinder.org works! yahoo! :-)

@sgibson91 wanna make a PR for the docs here: https://binderhub.readthedocs.io/en/latest/federation/federation.html ?

@sgibson91
Copy link
Member Author

@choldgraf will put it on my todo for tomorrow!

@matthewfeickert
Copy link
Contributor

@sgibson91 I thought I'd ask you here first before opening an Issue, but I noticed today that https://turing.mybinder.org/ can't be reached and the mybinder Grafana shows no recent activity. This isn't a big deal at the moment, but I was just curious if this was intended.

@sgibson91
Copy link
Member Author

Yes, Turing's cluster upgraded its version of k8s and is no longer compatible with the nginx ingress chart version. Since fixing this will require downtime for mybinder.org, we're waiting til after the conference run to do it. See #1485. Thanks for checking!

@matthewfeickert
Copy link
Contributor

Since fixing this will require downtime for mybinder.org, we're waiting til after the conference run to do it.

Thanks very much! Sorry I missed the issue while searching, but I appreciate you pointing me to the right place. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants