Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turing joining the Binder Federation: Part 2! #1154

Closed
7 tasks done
sgibson91 opened this issue Sep 10, 2019 · 58 comments
Closed
7 tasks done

Turing joining the Binder Federation: Part 2! #1154

sgibson91 opened this issue Sep 10, 2019 · 58 comments
Assignees

Comments

@sgibson91
Copy link
Member

sgibson91 commented Sep 10, 2019

The proposal we wrote in #1124 was accepted! We now have an Azure subscription with $10k to deploy a cluster on to 🎉 So this issue is documenting the next steps we'll be taking.

TODOs

I'm going to try and keep the naming conventions similar between the Azure and GKE clusters where possible.

Open Questions

  • How to install the the helm chart? Clone this repo, create a turing.yaml file and use that during the helm install?
  • Who organises the subdomain of mybinder.org? I'd recommend turing.mybinder.org.

I'll keep this updated as more things occur to me 😄

cc: @KirstieJane

@sgibson91 sgibson91 self-assigned this Sep 10, 2019
@betatim
Copy link
Member

betatim commented Sep 10, 2019

For the subdomain: create a new issue in the team-compass repo (like jupyterhub/team-compass#203, we don't have a template/procedure for this yet :-/). To actually execute the change we need Chris or Min. With the issue we can create a paper trail and officially decide to add the subdomain.

You will also need a domain for the jupyterhub, do you want that as hub.turing.mybinder.org (GKE style) or will the Turing hub have its own domain (OVH style)? Something to discuss in the subdomain issue.

Deployment: mirroring the OVH setup would be the way I'd go. So a new turing.yml plus some secrets and maybe some azure specific additions to deploy.py. Can you create an account on KeyBase (and verify it with at least your GitHub account) then I can send you the keys for the secret content.

@sgibson91
Copy link
Member Author

sgibson91 commented Sep 10, 2019

You will also need a domain for the jupyterhub, do you want that as hub.turing.mybinder.org (GKE style) or will the Turing hub have its own domain (OVH style)? Something to discuss in the subdomain issue.

Cool, will open the issue. I think hub.turing.mybinder.org will be fine.

Done in jupyterhub/team-compass#205

Can you create an account on KeyBase (and verify it with at least your GitHub account) then I can send you the keys for the secret content.

Sure, I'll try and do that at some point today.

@betatim
Copy link
Member

betatim commented Sep 10, 2019

If you need a domain to test the setup with before we have the "final" details for the cluster let me know and I can assign a throw away subdomain from wtte.ch. If it is convenient to have a domain that can be updated more quickly than mybinder.org which requires someone in a different timezone. Or you register your own domain to host throwaway stuff :D

@sgibson91
Copy link
Member Author

Keybase account created and verified with GitHub 👍

@sgibson91
Copy link
Member Author

sgibson91 commented Sep 30, 2019

Update:

  • I've created a resource group called binder-prod (equivalent to a GKE project). The location of this group (and hence all resources within it) is westeurope.
  • I've deployed an ACR called turingmybinderregistry for image storage
  • I've requested Turing IT for a Service Principal to deploy the k8s cluster and requested that it has AcrPush rights

@sgibson91
Copy link
Member Author

Service Principal received! Will deploy the cluster soon.

@betatim
Copy link
Member

betatim commented Oct 2, 2019

What is a resource group? Is it a azure name for a kubernetes concept (namespaces)? OR a azure cloud thing?

Completely selfish suggestion: do you have time for a tour of (very!) basic Azure stuff during the team meeting? I'd reciprocate with a tour of the Google cloud UI, buttons and CLI commands.

@manics
Copy link
Member

manics commented Oct 2, 2019

It's an azure cloud thing- a way of grouping resources (compute, storage, network, etc).

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 2, 2019

Yes, a Resource Group is just a label. Computationally means nothing, but allows you to group together resources that are related. (Here, "related" means that I, as a human, know that these things are being used for the same conceptual project.)

Yes, I'm happy to give a tour during the team meeting, we could maybe do a specific zoom call for this too so there's more time for questions?

@KirstieJane
Copy link

(Just a small note to say 😻 😻 😻)

Should we add notes about comms etc to this issue? Or keep this technical and make a new issue to drum up lots of excitement 😉 ?

@sgibson91
Copy link
Member Author

Thank you! 💖 I think keep this one technical and a second one for comms 😄

@betatim
Copy link
Member

betatim commented Oct 2, 2019

A new meeting would be nice but also tricky because we'd have to find a timeslot for it.

Depending on how much is on the agenda for the next meeting I'd be happy to spend 20-30min of the meeting to listen and ask a few questions. When I wrote my earlier comment I was thinking of watching you setup a kubernetes cluster, install something on it, look at the logs, do something else, done. Something to take away the feeling of "oh wow, so many buttons and it all has different names to Google cloud. Ok maybe I need to block off a few hours to just figure out where I am."


New issue for comms sounds good.

@manics
Copy link
Member

manics commented Oct 2, 2019

It's a lot of work and requires self-confidence, but if you're up for it you could record a screencast on your own and upload it to e.g. youtube? Could also be linked from the docs.

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 2, 2019

We could have a 1-to-1 zoom call if you wanted, I may also be able to do a screencast at some point. But tbh, my usual Azure workflow is having the Azure CLI installed locally and running stuff from my terminal. Deploying the k8s cluster will be very similar to the JupyterHub docs, but I'll probably do it with autoscaling. There's also the docs I (try to) keep updated in the hub23-deploy repo.

I spend more time looking at kubectl logs than I do anything on the Portal.

@betatim
Copy link
Member

betatim commented Oct 2, 2019

Ok, that is already super useful ("using CLI most of the time, hardly ever click"). Let's see how we are doing for time at the meeting and if there is interest but I'd be happy to show people around https://console.cloud.google.com/.

Back to discussing "Turing joins the federation" :D

@sgibson91
Copy link
Member Author

Last comment on this topic is that I added it to the agenda for the meeting 🎉

Back to the proper topic! Turing is switching its subscription backend (which is more important if you're interested in billing than interacting with resources), so I think I will migrate the subscription before deploying the cluster. It's quite a lengthy process - took 6 hours to migrate a single VM 😱- so doing that before we get a load of resources set up will probably be easier.

@sgibson91
Copy link
Member Author

I finally managed to deploy a cluster! I'm going to do some tests with a basic BinderHub set-up before I properly integrate it. Lots of stuff has been migrating on Azure recently so I want to check it all still works.

@sgibson91
Copy link
Member Author

sgibson91 commented Oct 15, 2019

@sgibson91
Copy link
Member Author

Where I'm currently at with the Turing federation cluster.

Running deploy.py turing turing locally produces this helm chart templating error:

Error: render error in "mybinder/templates/matomo/secret.yaml": template: mybinder/templates/matomo/secret.yaml:11:61: executing "mybinder/templates/matomo/secret.yaml" at <b64enc>: wrong number of args for b64enc: want 1 got 0

Which means it's looking for:

matomo:
  db:
    serviceAccountKey:

in secrets/config/turing.yaml.

What is that and how do I get one?

@sgibson91
Copy link
Member Author

For that matter, how come we have matomo as a top level key but it's not listed in the chart requirements? Where does this dependency come from?

@choldgraf
Copy link
Member

Hmmm, I believe that Matomo was planned to be used instead of Google Analytics (maybe @yuvipanda set it up?) but I don't believe we are actively deploying it...somebody correct me if I'm wrong!

@yuvipanda
Copy link
Contributor

It comes from https://github.com/jupyterhub/mybinder.org-deploy/tree/master/mybinder/templates/matomo. Along with all the custom stuff in https://github.com/jupyterhub/mybinder.org-deploy/tree/master/mybinder/templates.

We do have it deployed (https://mybinder.org/matomo/index.php) and collecting data. I was hoping to remove Google Analytics to give our users more privacy (See #725 for more info). I'm not super involved anymore, so I understand if folks wanna remove it and keep a hard dependency on Google Analytics instead.

@sgibson91
Copy link
Member Author

Thanks everyone! I don't mind if we keep it or scrap it, but I need to know how to set it up for the Turing cluster so I can remove it as a blocker. I'm going to try generating an auth_token here and see if that's enough.

@sgibson91
Copy link
Member Author

sgibson91 commented Nov 26, 2019

So one thing that seems to work was just leaving the serviceAccountKey field for matomo blank.

I'm now very close to having BinderHub installed on the Turing cluster, except deploy.py keeps timing out during the helm upgrade --install command 😫(related issue: helm/charts#11904) So I may try just running the commands in deploy.py manually and doing helm install instead.

@sgibson91
Copy link
Member Author

Actually, all the pods are running except for the binder pod itself. kubectl describe output below the fold - basically having problems mounting volumes.

Binder pod
Name:           binder-8478b6b6c5-x8n45
Namespace:      turing
Priority:       0
Node:           aks-default-14930255-vmss000000/10.240.0.4
Start Time:     Tue, 26 Nov 2019 14:06:35 +0000
Labels:         app=binder
                component=binder
                heritage=Tiller
                name=binder
                pod-template-hash=8478b6b6c5
                release=turing
Annotations:    checksum/config-map: 3b98386cb77627ae3a7d9990babb531d2f458ca96bc0bf260982d33d4ed09058
                checksum/secret: c1c9e90aae368e4904d41f8208532e5a76fefa4bd265245f618fc79f8653ba39
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/binder-8478b6b6c5
Containers:
  binder:
    Container ID:
    Image:          jupyterhub/k8s-binderhub:0.1.0-456.7e32ac0
    Image ID:
    Port:           8585/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     250m
      memory:  1Gi
    Liveness:  http-get http://:binder/about delay=10s timeout=10s period=5s #success=1 #failure=3
    Environment:
      BUILD_NAMESPACE:                 turing (v1:metadata.namespace)
      JUPYTERHUB_API_TOKEN:            <set to the key 'binder.hub-token' in secret 'binder-secret'>  Optional: false
      GOOGLE_APPLICATION_CREDENTIALS:  /secrets/service-account.json
    Mounts:
      /etc/binderhub/config/ from config (rw)
      /etc/binderhub/secret/ from secret-config (rw)
      /root/.docker from docker-secret (ro)
      /secrets from secrets (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from binderhub-token-zps84 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      binder-config
    Optional:  false
  secret-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  binder-secret
    Optional:    false
  docker-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  binder-push-secret
    Optional:    false
  secrets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  events-archiver-secrets
    Optional:    false
  binderhub-token-zps84:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  binderhub-token-zps84
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age                   From                                      Message
  ----     ------       ----                  ----                                      -------
  Normal   Scheduled    53m                   default-scheduler                         Successfully assigned turing/binder-8478b6b6c5-x8n45 to aks-default-14930255-vmss000000
  Warning  FailedMount  8m56s (x20 over 51m)  kubelet, aks-default-14930255-vmss000000  Unable to mount volumes for pod "binder-8478b6b6c5-x8n45_turing(f49d4d3e-1055-11ea-a113-4eb282213ae6)": timeout expired waiting for volumes to attach or mount for pod "turing"/"binder-8478b6b6c5-x8n45". list of unmounted volumes=[secrets]. list of unattached volumes=[config secret-config docker-secret secrets binderhub-token-zps84]
  Warning  FailedMount  3m (x33 over 53m)     kubelet, aks-default-14930255-vmss000000  MountVolume.SetUp failed for volume "secrets" : secret "events-archiver-secrets" not found

@yuvipanda
Copy link
Contributor

I think if we have matomo, we can just run it on the main cluster instead of doing that per cluster. Similar to our analytics stuff. How does that feel?

@sgibson91
Copy link
Member Author

@yuvipanda This sounds perfect! I do think we need to have a refactor of the configs (as per the discussion here) so that GKE-specific stuff doesn't present a blocker to other new federation members. I'd like someone who's a bit more familiar with what's what in all the various yaml files to help me on that though. So I don't break anything! 😂

@betatim
Copy link
Member

betatim commented Dec 10, 2019

The plan looks good. Agree that we want to keep the domains separate. I'd get the PR merged and cluster running, then slowly step up the quota and see what happens. For this we need a working grafana that shows the launch success rate. Do you have the admin PW for grafana.mybinder.org? Then we could add the turing prometheus as a datasource there and get all the panels for free.

The thing I'd look out for is errors related to the container registry as the traffic increases.

@manics
Copy link
Member

manics commented Dec 10, 2019

What version of BinderHub is running on https://turing.mybinder.org/? It doesn't look like the latest.

@manics
Copy link
Member

manics commented Dec 10, 2019

Looks like outbound egress isn't restricted to these ports:

networkPolicy:
enabled: true
egress:
tcpPorts:
- 80 # http
- 443 # https
- 9418 # git
- 873 # rsync
- 1094 # xroot
- 1095 # xroot
cidr: 0.0.0.0/0

@betatim
Copy link
Member

betatim commented Dec 11, 2019

#1203 is the PR with config from which turing is deployed (manually).

@sgibson91
Copy link
Member Author

Looks like outbound egress isn't restricted to these ports:

networkPolicy:
enabled: true
egress:
tcpPorts:
- 80 # http
- 443 # https
- 9418 # git
- 873 # rsync
- 1094 # xroot
- 1095 # xroot
cidr: 0.0.0.0/0

@manics I think this deserves it's own issue as that wasn't part of the config that I edited and is, therefore, perhaps a problem across all clusters?

@sgibson91
Copy link
Member Author

sgibson91 commented Dec 11, 2019

What version of BinderHub is running on https://turing.mybinder.org/? It doesn't look like the latest.

I'm just attempting git pull master && cd mybinder && helm dep up && cd .. && python deploy.py turing turing, but once again, helm is not playing nicely! (I wish it gave more useful error messages :( )

$ python deploy.py turing turing
The behavior of this command has been altered by the following extension: aks-preview
Merged "turing" as current context in ~/.kube/config

$HELM_HOME has been configured at ~/.helm.

Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!
deployment "tiller-deploy" successfully rolled out
Updating network-bans for turing
Starting helm upgrade for turing
Error: UPGRADE FAILED: a released named turing is in use, cannot re-use a name that is still in use
Traceback (most recent call last):
  File "deploy.py", line 233, in <module>
    main()
  File "deploy.py", line 227, in main
    deploy(args.release, "turing")
  File "deploy.py", line 176, in deploy
    subprocess.check_call(helm)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--namespace', 'turing', 'turing', 'mybinder', '--force', '--wait', '--timeout', '600', '-f', 'config/turing.yaml', '-f', 'secrets/config/common.yaml', '-f', 'secrets/config/turing.yaml']' returned non-zero exit status 1.

@manics
Copy link
Member

manics commented Dec 11, 2019

The NetworkPolicy was added last year #699 so it should be included.

@sgibson91
Copy link
Member Author

sgibson91 commented Dec 11, 2019

Exactly, if it was added last year and is not effective then that's a separate issue to me incorporating the Turing into the federation? Or are you saying that they're not working on the Turing cluster but are on others? (I've just managed to set the Turing cluster on fire so can't test this right now.)

@sgibson91
Copy link
Member Author

Do you have the admin PW for grafana.mybinder.org?

@betatim no I've never set up grafana before, how do I go about retrieving it?

@sgibson91
Copy link
Member Author

sgibson91 commented Dec 11, 2019

Current bugs:

  • testhub.hub23.turing.ac.uk has fake certificates whereas testbinder.hub23.turing.ac.uk has real ones, pretty sure I'm using letsencrypt-prod cluster issuer in both cases so I have no idea what's going on there
    • @consideRatio do you have any advice here? How do I find out if I was banned from Let's Encrypt?
  • grafana pods are complaining about shared volume mounts, they can take a really long time to finally initialise (sometimes I have to manually delete them) - see pod description
  • network policy and the egress ports aren't actually restricted (confirmed this was turing only on gitter)

@sgibson91
Copy link
Member Author

  • testhub.hub23.turing.ac.uk has fake certificates whereas testbinder.hub23.turing.ac.uk has real ones, pretty sure I'm using letsencrypt-prod cluster issuer in both cases so I have no idea what's going on there
    • @consideRatio do you have any advice here? How do I find out if I was banned from Let's Encrypt?

I solved this by re-deploying with new A records and new secrets.

  • grafana pods are complaining about shared volume mounts, they can take a really long time to finally initialise (sometimes I have to manually delete them) - see pod description

I'm not sure if this is happening because the WiFi at the Turing is terrible this week (we're running a data study group and have a lot of people here using interwebs), it seems pretty variable as to whether the grafana pods switching over causes deploy.py to time out or not. I might try tonight on my own connection.

  • network policy and the egress ports aren't actually restricted (confirmed this was turing only on gitter)

@manics The hub is now at newhub.hub23.turing.ac.uk and the certificates should now be real. Can we check again if this is still an issue? If so, what do we need to do to solve this?

@manics
Copy link
Member

manics commented Dec 12, 2019

I can still ssh out of `https://newbinder.hub23.turing.ac.uk/

kubectl -n NAMESPACE describe netpol should list the currently deployed network policies

@sgibson91
Copy link
Member Author

@manics
Copy link
Member

manics commented Dec 12, 2019

Looks like the policies are created, next thing is to verify that the cluster implements them.
https://docs.microsoft.com/en-us/azure/aks/use-network-policies#create-an-aks-cluster-and-enable-network-policy suggests it's optional, is there anything in Azure that tells you whether they're active on your cluster?

@sgibson91
Copy link
Member Author

I will try and get hold of @trallard today

@sgibson91
Copy link
Member Author

image

My guess would be this is where we can edit the network policies - but annoying that it's not automatically applied.

@manics
Copy link
Member

manics commented Dec 13, 2019

I think those are security rules which are independent from the K8s rules. It's the equivalent of a "physical" firewall operating at the network level. Then the K8s network policies are in addition to these, and they're implemented at the software level inside each Kubernetes VM. Either can be used to restrict network traffic, but obviously only the K8s network policies will be managed through the helm chart deployment.

@sgibson91
Copy link
Member Author

I think this IP address is now blocked as expected: 51.124.8.42

I may have to tear the Turing cluster down and redeploy with a vnet.

@sgibson91
Copy link
Member Author

Redeployed cluster with a virtual network to solve the unrestricted pod issue. Currently re-installing BinderHub.

@sgibson91
Copy link
Member Author

Meant to post this comment here: #1203 (comment) Currently having certmanager issues. I get a certificate for the hub but not the binder page.

@sgibson91
Copy link
Member Author

sgibson91 commented Jan 7, 2020

Using debug commands found here, I learned that the cluster has no challenge resources.

cert-manager issue could be related to this issue: cert-manager/cert-manager#1745

Output of cert-manager logs shows following message:

E0107 09:59:44.175449       1 controller.go:193] cert-manager/controller/challenges "msg"="challenge in work queue no longer exists" "error"="challenge.acme.cert-manager.io \"kubelego-tls-binder-turing-4161325376-3014481631-2909509088\" not found" 

I can't find anything called kubelego-tls-binder-turing in the config (doing a repo wide search) so I don't know where that secret name has come from. I deleted it from the k8s namespace and redeployed.

@sgibson91
Copy link
Member Author

Fixed the naming issue. Still no luck with lets encrypt though. Currently have a solver issue.

@sgibson91
Copy link
Member Author

Got let's encrypt working!!!! 🎉 🎉 🎉

@sgibson91
Copy link
Member Author

Clarifying what I need to do for grafana:

  • password stored in secrets/config/prod.yaml under grafana.adminPassword
  • add turing prometheus as a data source

@sgibson91
Copy link
Member Author

MERGED!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants