Skip to content

casperakos/terraform-gke-kubeflow-cluster

 
 

Repository files navigation

terraform-gke-kubeflow-cluster

lifecycle License

A Terraform module for creating a GKE cluster to run Kubeflow on.

This module creates a GKE cluster similiar to how the kfctl tool does, with a few changes:

  • adds a Cloud SQL instance to use for the metadata store/databases
  • creates a GCE Persistent Disk to use for the artifact store

This module was originally created by the ML Infrastructure team at Spotify to create and manage long-lived GKE clusters for many Kubeflow-using teams at Spotify to use, whereas the kfctl tool and documentation around creating a cluster for Kubeflow tends to assume that individual clusters are quickly spun-up and torn-down by engineers using Kubeflow. For more details on how Spotify's centralized Kubeflow platform, see this talk from Kubecon North America 2019.

Usage

To use this within Terraform, add a module block like:

module "kubeflow-cluster" {
  source  = "spotify/kubeflow-cluster/gke"
  version = "0.0.1"
}

For more details, see https://registry.terraform.io/modules/spotify/kubeflow-cluster/gke/0.0.1

Module details

The terraform-gke-kubeflow-cluster module creates the following resources:

  • a GKE cluster (attached to a Shared VPC if the relevant parameters for networks/subnetworks are set)
  • a Cloud SQL instance to use for the metadata store/databases
  • a GCE Persistent Disk to use for Argo's artifact store
  • GCP service accounts for Kubeflow to use (distinct accounts per cluster):
    • an "admin" service account (used for IAP - which is not included in this module)
    • the "user" service account for Kubeflow pipelines to use
    • the VM service account used by the GKE cluster/nodes itself
  • IAM bindings for the above service accounts
  • Kubernetes secrets for:
    • cloudsql-instance-credentials for the cloudsql-proxy connected to the metadata SQL instance
    • admin-gcp-sa containing the "admin" GCP service account for Kubeflow
    • user-gcp-sa containing the "user" GCP service account for Kubeflow

Each "instantiation" of the module creates a new set of all of these resources

  • the intent of the module is to automate the entire setup of all of the GCP resources needed to run a Kubeflow cluster.

This repo does not currently actually install the Kubeflow system components on the cluster - use kfctl or another tool for that.

Local development

Run the following commands from the root of the project:

  1. brew install tfenv -- install tfenv
  2. tfenv install -- install the version of Terraform specified in .terraform-version in source control
  3. terraform init -- setup terraform providers

Note on master and node version values

The expected behavior of fuzzy versions for min_master_version and node_version is undocumented (Github issue). From empirical evidence, the behavior so far is that the most recent version that matches the fuzzy version is used. For example, node_version = "1.11" results in GKE nodes running 1.11.7-gke.6 if that's the most recent version.

Releasing new versions of the module

See https://www.terraform.io/docs/registry/modules/publish.html#releasing-new-versions

A webhook has been automatically added to the repo, and a new "release" will be visible in the Terraform Registry whenever a new tag is pushed that looks like a semantic version (e.g. "v1.2.3"). So to cut a release, simply tag a commit and make sure to push the tag to Github with git push --tags.

Code of Conduct

This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.

About

Terraform module for creating GKE clusters to run Kubeflow

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HCL 89.7%
  • Shell 10.3%