Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s maintenance: k8s node pool upgrades on EKS clusters #4009

Closed
14 tasks done
Tracked by #3955
consideRatio opened this issue Apr 30, 2024 · 13 comments · Fixed by #4143
Closed
14 tasks done
Tracked by #3955

k8s maintenance: k8s node pool upgrades on EKS clusters #4009

consideRatio opened this issue Apr 30, 2024 · 13 comments · Fixed by #4143

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Apr 30, 2024

#4007 upgraded all control planes to 1.29, but we need to bring the node pools to 1.29 as well. Documentation on how to do this was updated in #4099 and is made available at https://infrastructure.2i2c.org/howto/upgrade-cluster/aws/ - where step 4 about upgrading the control plane is already done.

Clusters with node groups to upgrade

  • 2i2c-aws-us
  • catalystproject-africa
  • earthscope
  • gridsst
  • jupyter-meets-the-earth
  • nasa-cryo
    • Waiting for old user server node group to drain so it can then be deleted
  • nasa-esdis
  • nasa-ghg
  • nasa-veda
  • openscapes
  • opensci
  • smithsonian
  • ubc-eoas
  • victor
@yuvipanda
Copy link
Member

Should be paired with someone else (TBD)

@consideRatio
Copy link
Contributor Author

@sgibson91 agreed to do this work this during this sprint. I'm around to discuss anything if requested.

@sgibson91
Copy link
Member

Most clusters are done, there are only the remaining clusters which have some user servers still up

Cluster Number of running singleuser servers
nasa-cryo 2
nasa-ghg 3

@sgibson91
Copy link
Member

2 of the nasa-cryo user servers have been up for >2days and 1 of the nasa-ghg user serves has been up for 36 hours. I suspect these might be abandoned so maybe they're fine to kick off?

@consideRatio
Copy link
Contributor Author

2 of the nasa-cryo user servers have been up for >2days and 1 of the nasa-ghg user serves has been up for 36 hours. I suspect these might be abandoned so maybe they're fine to kick off?

You could do a rolling upgrade without using drain on the node pool they are using, and in the "taint and wait step" you push the changes up to that point and get a PR merged - leaving only a comment saying we need to also delete a node pool, something that hopefully can be done next week at least.

@sgibson91
Copy link
Member

I've done that for nasa-cryo. I was about to do it for nasa-ghg, but the public ssh key isn't in the repo and so eksctl commands failed 😕

@sgibson91
Copy link
Member

Is there a way to reverse this command?

kubectl taint node manual-phaseout:NoSchedule -l alpha.eksctl.io/nodegroup-name=nb-r5-4xlarge

@consideRatio
Copy link
Contributor Author

@sgibson91 ah hmmm okay!

Hmmmm, deleting a taint can be done by kubectl edit node <node name>, maybe there is a one-liner kubectl also but I've not found out yet though.

@sgibson91
Copy link
Member

Thanks Erik. I used that command to remove the taint, and have opened #4122

@sgibson91
Copy link
Member

Upgrading the core node group for nasa-ghg was also going to remove node group "nb-c5-4xlarge", which I believe comes from #4100, so I used --exclude="nb-c5-4xlarge" to leave this alone

@sgibson91
Copy link
Member

I've put in a reminder for myself on Tuesday to check if the old node groups have drained (Monday is a holiday in the UK and I'll be coming back from a weekend away)

@sgibson91
Copy link
Member

sgibson91 commented Jul 24, 2024

Hmmmm, deleting a taint can be done by kubectl edit node <node name>, maybe there is a one-liner kubectl also but I've not found out yet though.

@consideRatio I learned a one-liner! Example, to remove the node-role.kubernetes.io/control-plane taint from all nodes (I don't know why you'd want to do this, but it's the example that came up in my course).

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

The key is the extra - at the end which is the syntax for removing a taint.

So I imagine a more targeted command would be

kubectl taint node <node-name> <taint-name>-

@consideRatio
Copy link
Contributor Author

Thank you for sharing this @sgibson91!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants