-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually create a cluster (no terraform) #5355
Comments
@jmunroe, I believe you mentioned that there were some docs about this from the project pythia side 🤔? Could you please share those here so we can have them as a reference when we start working on this on Monday? |
I interpreted James' comment as "Project Pythia followed JetStream's docs and they got through it fine", just to give a different interpretation. May be wrong though! |
@sgibson91 makes sense. My understanding was that they experimented with it documented the whole process 🤷♀ |
My assumption is that these are the correct starting points:
There have been previous uses of kubernetes on JetStream2 such as kubespray (docs) but I had understood those to be fairly manual, non-scalable ways of deploying a cluster. What my ask of the JetStream2 team has been a 'managed kubernetes service' that we can build on top of. I think OpenStack Magnum and ClusterAPI are some of the enabling technologies used by the JetStream2 team but I am not entirely up to speed on the details. My primary contact at JetStream has been Julian Pistorius ([email protected]). Julian is already on our 2i2c slack. |
Current state@sgibson91 and I started deploying a cluster today to the new allocation that @jmunroe created for us. The process currently fails with CREATE_FAILED:
While investigating this, we realized that we don't have access to run commands such as:
What we've triedWe tried to create new application credentials that would be more permissive and give it the The only available roles for us are the ones in the screenshot below: ![]() From the blog post we are following it looks like the |
I've emailed Julian to seek additional guidance (See https://2i2c.freshdesk.com/a/tickets/2690). |
Potentially relevant Jetstream2 issues:
|
Thank you @jmunroe! Nvm, we still need permissions to the identity endpoint! |
Update: @sgibson91 and I opened a ticket on JetStream2 at https://jetstream-cloud.org/contact/index.html asking for guidance about this permission error. Confirmation email https://2i2c.freshdesk.com/a/tickets/2691 |
I've had success creating a Kubernetes cluster using Magnum following Andrea Zonca's blog post. When I was create the application credentials I did select the 'unrestricted (dangerous)' option. I still can't run all openstack commands (like I did need to be patient though. It took 117 minutes for the cluster to be created while in zonca's post he timed it at 9 minutes. Perhaps now that images have been copied over, will it run faster? I think we should increate the quota on the number of Volume that openstack allows. I'll submit that support ticket to the Jetstream2 team. @GeorgianaElena Please let me know if you'd like to meet tomorrow so we can make sure you are able to do what appears to work for me. I'll grab an early morning slot on your calendar. |
I requested an increase from 10 Volumes to 30 Volumes through the JS2 help desk. |
Quota for Volumes now set to 30. I deleted my test My hope is that it will be faster than 117 minutes this time but it is currently at 25 minutes and still pending. I'll report back how long it actually takes. |
I'm happy to see that the JS2 support desk was able to respond and take action on my request in <60 minutes! |
Unfortunately, attempt 2 was not successful. The cluster creation was stuck in a 'CREATE_IN_PROGRESS' state. It appears that a control plan node get created persists for about 60 minutes, is killed, then is recreated. Behind the scenes, my understanding of Using I gave a few attempts at adding a private ssh keypair and create security groups to attempt to log in to the control plane to poke around and try and find some logs but I don't think I was setting up the openstack networking correctly. I had a floating IP assigned and the SSH port was open but no luck. |
A key blog post for understanding what actually has been set up in Jetstream2 is https://stackhpc.com/magnum-clusterapi.html . Importantly, references to 'Heat templates' is the older way Magnum was used to deploy Kubernetes. Magnum Cluster API Helm Driver docs The newer way (which will be easier to maintain) is to use a Magnum driver for ClusterAPI. One approach is to deploy a 'management cluster' with ClusterAPI (somewhere, doesn't actually need to be on Jetstream2 itself) and then use that management cluster to launch new workload kubernetes clusters. My big assumption is that this management cluster is something we as users of Jetstream2 don't actually have to deploy ourselves but somehow |
See notes for a summary of lessons learned from this spike around deploying Kubernetes clusters on Jetstream2 using Magnum with the ClusterAPI driver. Before moving on to deploying JupyterHub, I think we can identify the following 'next steps' that we need to resolve:
|
I've been pleased with the progress we've made on this iteration. I especially appreciate the mental model of Jetstream2/Openstack/Magnum/ClusterAPI as components we are building up. I think we should close this current issue and identify the specific blockers and add those to the next iteration to resolve. Possible next tasks: Consistently deploy kubernetes cluster with
|
@jmunroe, I have played around today with the cluster that magically appeared as ready over the weekend. I've put my findings at https://hackmd.io/DzgY3PW4TUOMAmpqvgOdZw?view#Experimenting-with-a-running-healthy-cluster. But to summarize: SSH key pair access to Kubernetes control plane nodesThis worked (kind of). If I would have created the cluster with the So the fact that this doesn't work on control plane nodes part of clusters stuck in kubectl accessI still couldn't get access into the cluster with kubectl. For that matter, pinging the public IP of the cluster fails. There are firewall rules permitting access, but it still doesn't work. I saw something weird while investing the network setup, which is that the loadbalancers that get created with the cluster appear as |
I think we can infer that that first control plane node is being created but is not coming up fully (which is why it is killed and restarted after 60 minutes). I think this is what the I wonder if I just spin up a instance of the image |
No great insights. It appears starting a single instance of |
Hello @jmunroe, @GeorgianaElena, and @sgibson91. I'm catching up with this issue and the related ACCESS support ticket. It looks like you have a good grasp of the process for creating clusters, and have had some success.
This assumption is indeed correct. There is a single Cluster API management cluster for all workload clusters created with Magnum/ |
Hello @GeorgianaElena & @jmunroe, As I wrote in the support ticket:
|
Yep! That definitely fixed something. 10 minutes to get to
@GeorgianaElena could you try creating some kubernetes clusters to confirm it works robustly? |
My two clusters that were creating since Monday have now completed. I managed to get the kubeconfig file and run kubectl against the cluster
I also verified that exporting the kubeconfig file and using it with the deployer also permits kubectl access: #5357 (comment) |
@GeorgianaElena and I have verified that if we use the same Application Credentials the we can manage each others kubernetes clusters. I've shared a something for the CIS250031 allocation with the 2i2c team through Bitwarden. I think that is "secure" but I'll take guidance if there is better way of sharing those credentials between members of the engineering team. I think we already do something like this to allow us to use the |
I believe we can store the credentials encrypted just like we do with the actual kubeconfig and then tweak the deployer to use those credentials to authenticate before using the kubeconfig.
Yes, I created two clusters today which took about 10min, so all good now 🎉 . Thank you! |
It appears that autoscaling and manually scaling of the kubernetes cluster is not working well together. If I deploy a cluster and omit the lines relating to autoscaling: then I can successfully manual upscale and downscale the kubernetes cluster using
(Reference: zonca/zonca.dev#9 (comment)) |
If it's one or the other, autoscaling is what needs to work. Basically I want us to be able to trust that nodes will be able to come and go. I think a simple way to test this for autoscaling is to create a deployment object that's basic, and try to increase the number of replicas it has. This should trigger new nodes. And then if you reduce it, it should clean up nodes. repeat until you can see at least nodes come and go 3 times. |
I've verified that autoscaling up and then back down has worked (following https://satishdotpatel.github.io/kubernetes-cluster-autoscaler-with-magnum-capi/):
Comment about openstack's nodegroupsI was watching the output of
during this experiment. I was expecting to see the node_count for the default-worker group go from 1 to 2 then back down to 1. that was not observed.
My working assumption is the the fields |
Yes, manual scaling & auto-scaling are mutually exclusive. |
Documentation now available at https://hackmd.io/DzgY3PW4TUOMAmpqvgOdZw?view#How-to-create-and-scale-a-cluster-on-Jetstream2. |
Resources to get us started:
Definition of done:
openstack coe cluster create
3 times in a row over 2 daysopenscack
commandsThe text was updated successfully, but these errors were encountered: