-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Adding Turing cluster to the Federation #1203
Conversation
- Copy ovh.yaml to turing.yaml - Remove refs to ovh throughout - Leave blank the things I don't know yet 😄
- Configuring ingress for turing.mybinder.org - May need replacing with test subdomain
- Took some guesses at the various hosts
We will also have to add the credentials to We should do a quick review of all the support services (matomo, gcs, redirector, etc) to see which of those we need and which we should turn off for members of the federation. |
@betatim @minrk @choldgraf (and anyone else!) Can I get an early review of this? I'm probably missing a whole load of stuff, so planning the next stages of the PR would be very useful for me. Thanks! |
Thanks @manics! Co-Authored-By: Simon Li <[email protected]>
I'd start deploying this to see what happens. Two things to do before: temp domains and removing some of the top level keys that were mentioned in the team-compass issue as "we probably don't need those". Before we do this we need to replace all the hostnames with nip.io/xip.io domains. You can do a test of the test by running |
Running |
Running Console output:
Pod status:
Events in
Above may not be true, I might be getting confused with Hub23. Here are the labels the nodes do have:
Should I manually set |
Did |
Further info:
|
Above issue resolved: args to Now, we get a different issue:
|
This commit implements @minrk's suggestion of exposing a configMap as a variable. This (hopefully) circumvents the issue that the AKS cluster runs coredns whereas GKE runs kube-dns.
This is where I'm up to now. Not quite sure what the problem is here.
|
Switched helm version to 2.11.0, error in #1203 (comment) has now been replaced with the following.
Seems to be complaining about the |
I'm kind of stuck in that Let's Encrypt issues a certificate for hub.mybinder.turing.ac.uk but not binder.mybinder.turing.ac.uk, and I'm not sure how to fix :/ |
I think @consideRatio is the most likely to know what's going on here, unfortunately for us he's enjoying his vacation! Maybe @yuvipanda has some insight? Further info: #1154 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgibson91 i looked through changes. I didnt notice any configuration of certmanagers helm chart. And, if that isnt done, it will require some annotations on ingress objects to care about them other than the tls-acme: true annotation which acts like a "use defaults etc, lets go!".
Things that can go wrong in certificate allocations that i debug.
- Does the ingress object exist? If not, its not about cert-manager.
- Does the secret named inthe ingress configuration under tls: exists? If not...
- Does the cert-manager controller even attempt to get a certificate? Check its logs and check if a kubernetes certificate object has been created. If it has then annotations on the ingress resource was enough to trigger an attempt. Use
kubectl describe certificate ...
on it to see status and events on it that summarize the work done by the certmanager controller to get a tls-secret.
Typical issues:
- the controller doesnt attempt toget a certificate, because lacking valid default configuration or annotations overriding the default configuration on the ingress objects
- the certmanager controller does attempt, but fails. its attempt works like this afaik: a) speak with lets encrypt that it wants to proove itself that it in control over a domain and wants a certificate, b) it gets challanged to ensure that a request sent to the domain with a certain path will respond with a specific key by letsencrypt using the http01 kind of ACME challenge, c) the certmanager controller creates a pod to be the responding webserver replying with the key, d) the certmanager controller now need to ensure traffic is redirected to the pod, so it creates a new temp k8s ingress resource to direct traffic there (unless it modifies the existing one, which is also a possibility). e) it saticefies the challange if the pod receives traffic and the pod can respond back to letsencrypt as well. f) it receives a certificate from letsencrypt and creates a secret named something based on whats declared in the ingress and updates status about when to update it in the k8s certificate resource, which is a custom resource maintained by certmanager just like the issuers are.
Debugging ideas:
logs
of the certmanager controllerdescribe
on certificates created by the controllerget
on secrets, temporary ingress resources, and pods, to see if they are created by the certmanager controller during the challange- inspection of certmanagers helm charts configuration, especially about its ingress shim section where defaults can be provided about what kind of challange and issuer is to be used etc.
Question:
- the hub.something domain got a certificate, did it get it by certmanager? The there should be a certificate resource btw, i doubt that because that ingress had no configuration though annotation about what issuer to use, and the certmanager helm chwrt isnt configured to have a default issuer yet.
Terminology:
- "issuer" is a certmanager concept to reference something that can provide a certificate and speak in ACME protocol while doing so. As there are more options than letsencrypt, certmanager is agnostic to the issuer and ensure you actively specify one or configure one as the default one.
The certificate EDIT: Fixed this! Needed to set |
@sgibson91 - can you think of a place in the docs that was misleading, or where this information could have been placed? :-) |
I should have followed these docs 🤦♀https://binderhub.readthedocs.io/en/latest/https.html#adjust-binderhub-config-to-serve-via-https The problem was this if statement in the ingress template overwriting my given secret name because I hadn't changed the https type to nginx. |
Just upgraded the config to include: https:
enabled: true
type: nginx and the certificate name is still overwritten from However, LET'S ENCRYPT IS NOW WORKING!!!!!! 🎉 🎉 🎉 |
Remaining TODOs:
|
CNAME has been propagated! I recommend merging on Monday as I'm running a workshop today/tomorrow and won't be available to put out fires. |
Hey all - I'll merge this tomorrow morning if @sgibson91 agrees that's a good idea! |
@choldgraf yes I definitely want to merge! Can we check I've done the "right thing" with the grafana password first please? |
Hmmm - I am out-of-the-loop on that topic so I don't think I'll be helpful in figuring it out. What is the way that we can check this? |
Ill take a look right now. I have 24minutes uninterrupted till my train arrives :) |
This uses the same password for the Turing grafana as for the GKE. I think there is no reason why they have to be the same, so in a future PR we could change the Turing one. We could even uninstall grafana from the cluster as we've found how to connect the different prometheus instances to the grafana that runs on GKE. This means we only have to maintain one set of dashboards. |
turing.mybinder.org works! yahoo! :-) @sgibson91 wanna make a PR for the docs here: https://binderhub.readthedocs.io/en/latest/federation/federation.html ? |
@choldgraf will put it on my todo for tomorrow! |
@sgibson91 I thought I'd ask you here first before opening an Issue, but I noticed today that https://turing.mybinder.org/ can't be reached and the mybinder Grafana shows no recent activity. This isn't a big deal at the moment, but I was just curious if this was intended. |
Yes, Turing's cluster upgraded its version of k8s and is no longer compatible with the nginx ingress chart version. Since fixing this will require downtime for mybinder.org, we're waiting til after the conference run to do it. See #1485. Thanks for checking! |
Thanks very much! Sorry I missed the issue while searching, but I appreciate you pointing me to the right place. :) |
Summary
This PR will add the Turing cluster to the Federation. I will try to keep this top comment updated with implemented changes.
Related issues: #1154, jupyterhub/team-compass#205, jupyterhub/team-compass#217
What's changed
config/turing.yaml
secrets/config/turing.yaml
containing Turing secretsdeploy.py
and includedsecrets/turing-auth-key-prod.json
deploy.py
for turing cluster to.travis.yml
cert-manager
(instead ofkube-lego
), includedcluster-issuer.yaml
TODOs before ready for review
deploy.py
deploy.py
to.travis.yml
turing.mybinder.org
inconfig/turing.yaml
andconfig/turing/turing_mybinder_org_ingress.yaml
may need to be swapped for a test subdomainTODOs for review
config/turing.yaml
secrets/turing.yaml
TODOs before merging