Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GKE's native NVidia driver installer for GPUs #3478

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

yuvipanda
Copy link
Member

  • Remove our custom GPU installer daemonset, as GKE now supports automatically doing it (like eksctl does).
  • Switch from installing 'latest' to using default driver, which is slightly older (version 470 with CUDA 11.4, vs version 530 with CUDA 12). There seems to be a bug with the latest driver causing the GPU to not be usable by non-root users, so let's stick to this until that is resolved.
  • Apply these changes to LEAP hub already. m2lines is about to be decomissioned, so not necessary.

@yuvipanda yuvipanda requested a review from a team as a code owner November 30, 2023 20:31
Copy link

github-actions bot commented Nov 30, 2023

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
gcp m2lines Yes Support helm chart has been modified No
gcp hhmi Yes Support helm chart has been modified No
gcp linked-earth Yes Support helm chart has been modified No
aws ubc-eoas Yes Support helm chart has been modified No
gcp 2i2c Yes Support helm chart has been modified No
aws nasa-veda Yes Support helm chart has been modified No
aws smithsonian Yes Support helm chart has been modified No
gcp callysto Yes Support helm chart has been modified No
aws carbonplan Yes Support helm chart has been modified No
aws jupyter-meets-the-earth Yes Support helm chart has been modified No
gcp cloudbank Yes Support helm chart has been modified No
gcp catalystproject-latam Yes Support helm chart has been modified No
gcp awi-ciroh Yes Support helm chart has been modified No
aws nasa-ghg Yes Support helm chart has been modified No
aws openscapes Yes Support helm chart has been modified No
gcp qcl Yes Support helm chart has been modified No
aws nasa-cryo Yes Support helm chart has been modified No
gcp leap Yes Support helm chart has been modified No
aws catalystproject-africa Yes Support helm chart has been modified No
gcp pangeo-hubs Yes Support helm chart has been modified No
gcp 2i2c-uk Yes Support helm chart has been modified No
aws victor Yes Support helm chart has been modified No
aws gridsst Yes Support helm chart has been modified No
aws 2i2c-aws-us Yes Support helm chart has been modified No
gcp meom-ige Yes Support helm chart has been modified No
kubeconfig utoronto Yes Support helm chart has been modified No

Production deployments

No production hub upgrades will be triggered

- Remove our custom GPU installer daemonset, as GKE now supports
  automatically doing it (like eksctl does).
- Switch from installing 'latest' to using default driver, which
  is slightly older (version 470 with CUDA 11.4, vs version 530 with
  CUDA 12). There seems to be a bug with the latest driver causing
  the GPU to not be usable by non-root users, so let's stick to this
  until that is resolved.
- Apply these changes to LEAP hub already. m2lines is about to be
  decomissioned, so not necessary.
Copy link
Member

@sgibson91 sgibson91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@yuvipanda yuvipanda merged commit cb10f6d into 2i2c-org:master Dec 1, 2023
32 checks passed
Copy link

github-actions bot commented Dec 1, 2023

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/7064404222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

2 participants