Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform, azure and utoronto: an upgrade, misc to support it, and misc opportunistic details #3596

Merged
merged 8 commits into from
Jan 12, 2024

Conversation

consideRatio
Copy link
Contributor

@consideRatio consideRatio commented Jan 9, 2024

The terraform changes are applied already. I've done some things besides just upgrading the k8s cluster version that I believed were resonable to go for opportunistically.

Changes overviewed

  • a user node pool switch from Standard_E8s_v3 to _v5, which is like r3.2xlarge to r5.2xlarge in AWS. This price of these are the same.
  • a user node pool switch and a core node pool a prepared transition from use of persistent machine disks on user nodes to ephemeral machine disks (like in GKE also for example)
  • (reverted, autoscaling and 3 replicas remains) a pinning of core node pool count to 2 in order to avoid paying for an empty core node pool due to AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes #3592
  • added resolution of configuration in order to
    • enable upgrade of control plan and node pools separately
    • enable a non-disruptive transitions from misc machine types
  • misc fixme notes and comments with notable research behind them
  • removal of no longer relevant label (hub.jupyter.org/node-size)
  • an AKS transition towards referencing "notebook nodes" as "user nodes"
  • (added post-review) refactored use of multiple node pool variables to a single variable to reduce duplication of node pool variable definitions

This comment was relevant before 2i2c-org#3595, but not after.
@consideRatio consideRatio marked this pull request as ready for review January 9, 2024 17:18
@consideRatio consideRatio requested a review from a team as a code owner January 9, 2024 17:18
@consideRatio consideRatio changed the title terraform, azure and utoronto: an upgrade and misc to support it terraform, azure and utoronto: an upgrade, misc to support it, and misc opportunistic details Jan 9, 2024
terraform/azure/projects/utoronto.tfvars Outdated Show resolved Hide resolved
terraform/azure/variables.tf Outdated Show resolved Hide resolved
terraform/azure/variables.tf Outdated Show resolved Hide resolved
@consideRatio
Copy link
Contributor Author

consideRatio commented Jan 10, 2024

@yuvipanda thank you soo much for the quick an thorugh review!

  • (requested) I went back to autoscaling the core node pool and applied the change
  • (requested) I made name required for node pools
  • (not requested) I combined multiple node pool variables into a single variable to reduce the three ~identical variable definitions to one, and made a variable map into a list as the keys were entirely ignored anyhow
  • (not requested) I added and refined some comments

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @consideRatio! This mostly looks good to me - however, I'd like us to do one of two things:

  1. Have separate core_node_pool, notebook_nodepools and dask_nodepool, to match what we have on GCP
  2. Encode via type definition somehow that the only supported keys for the node_pools map are "core", "user" and "dask"

Currently, (2) is encoded in the code - if we use any other key, it will just silently fail.

I think (2) is preferable (here and in GCP too, although that's out of scope) but if that's not possible I'd like us to do (1) than leave it implicit.

I appreciate your cleanup efforts, @consideRatio!

@consideRatio
Copy link
Contributor Author

consideRatio commented Jan 11, 2024

Thank you for reviewing @yuvipanda!!

if we use any other key, it will just silently fail

I think I can come up with a way to make this fail loudly, but not via the type definition. I tried to do this myself but failed to find a way to do it without breaking the DRY principle and verbosly duplicating the type definition for any individual node pool. I don't want to break DRY in this case, I find it gets too messy and hard to maintain.

I'm aiming to meet your request for changes by failing loudly if:

  • node_pools variable includes keys besides core, user, and dask
  • node_pools value for core doesn't include exactly one node pool (since one and only one node pool maps to default_node_pool)

@consideRatio
Copy link
Contributor Author

Ahh googling for "terraform assertions" I ended up finding that there can be a "custom validation" block for variables like documented in https://developer.hashicorp.com/terraform/language/values/variables#custom-validation-rules - I'll go for that.

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work, and TIL about terraform validation too. Thank you for digging into this, @consideRatio! LGTM now.

@consideRatio consideRatio merged commit 479b04b into 2i2c-org:master Jan 12, 2024
3 checks passed
@consideRatio
Copy link
Contributor Author

Thank you for reviewing @yuvipanda !!

@yuvipanda
Copy link
Member

@consideRatio thank you for patiently doing a couple rounds with me async on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

2 participants