Add terraform config for an AWS cluster #2467

sgibson91 · 2022-12-15T15:34:04Z

related to #2449

This PR adds terraform config to deploy a k8s cluster into an AWS account, and also adds the cluster-autoscaler helm chart as a dependency to the my binder helm chart as autoscaling is not something that comes out of the box with eks.

Currently the terraform code assumes no ECR and an external registry such as quay.io will be used instead. A storage bucket could be added in the future to store the terraform state. Also some input regarding which AWS account to use (currently relies on environment variables).

The below error was avoided by removing the problematic config. By choosing an external container registry over an ECR, there was no need to create the role. #2467 (comment)

Following https://github.com/hashicorp/learn-terraform-provision-eks-cluster and other bits and pieces, this is the terraform config I've come up with so far to deploy a cluster and supporting infrastructure on AWS.

Currently I get the following error when deploying this:

╷
│ Error: failed creating IAM Role (pangeo-mybinder-k8s-ecr-iam-role): MalformedPolicyDocument: Has prohibited field Resource
│ 	status code: 400, request id: 6e4da11d-7d10-4d2c-84c6-72cf624bca4d
│
│   with aws_iam_role.k8s_ecr_iam_role,
│   on main.tf line 140, in resource "aws_iam_role" "k8s_ecr_iam_role":
│  140: resource "aws_iam_role" "k8s_ecr_iam_role" {
│
╵

The policy in question was retrieved from this section of some as-yet-unmerged documentation, so it could be that it's just not correct to be using the Resource attribute here:

https://github.com/jupyterhub/binderhub/pull/1055/files#diff-12fbfe728e3edf4301ff30644079395e48e3eeb466ac9bc6cedad95c830a3fe6R185-R233

This allows for addition of aws-centric terraform files without conflicting with the gcp ones

Update title of gcp/README.md to indicate that it is GCP-specific config; Add terraform/README.md to describe multi-cloud layout of config

Config generated by following this tutorial: https://github.com/hashicorp/learn-terraform-provision-eks-cluster Makes use of eks and vpc terraform modules for AWS

Create an image repository in a container registry and output its URL. We do not need to create the container registry, since every AWS account comes with a private container registry by default.

Create an IAM user with permissions to access the cluster for use from CI/CD. Output its key for storage.

sgibson91 · 2022-12-15T15:36:45Z

The two READMEs look like I've edited more than I did. What I did was:

Move the original terraform/README.md to terraform/gcp/README.md and changed the title to indicate that it refers to GKE only
Create a new terraform/README.md file explaining the need for different folders/config for different cloud providers

yuvipanda · 2022-12-16T09:03:49Z

terraform/aws/pangeo/main.tf

+          "sts:AssumeRole"
+        ],
+        "Effect": "Allow",
+        "Resource": "${aws_ecr_repository.image_repo.arn}*",


Temporarily, does getting rid of all the "Resource" fields or converting their values to "*" make the config apply?

This is the assume role policy so it say which entities (e.g. users, or other AWS services) are allowed to assume the role. E.g. the example in https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role#basic-example allows the role to be used as an EC2 instance role. The privileges for the role are specified separately, either using inline_policy, or in a separate aws_iam_policy linked to the role using aws_iam_policy_attachment.

If you want to use an IRSA role which you can link directly to a K8s service account ... it's more complicated as you need an K8s OIDC provider to act as the entity that assumes the role. I've got an example somewhere which I can dig out if you want?

Setting all "Resource" values to "*" produces the same error

Got a different error by removing all instances of "Resource"

│ Error: failed creating IAM Role (pangeo-mybinder-k8s-ecr-iam-role): MalformedPolicyDocument: Missing required field Principal │ status code: 400, request id: be65c496-2b4c-4dbf-8676-8a2b8e29bf5e │ │ with aws_iam_role.k8s_ecr_iam_role, │ on main.tf line 140, in resource "aws_iam_role" "k8s_ecr_iam_role": │ 140: resource "aws_iam_role" "k8s_ecr_iam_role" { │

Sure, I don't know how to move this forward otherwise. Though I won't be doing any more work here until the new year.

We will not be using ECR, there is a requirement for an image repo to already exist before an image can be pushed to it. This will require logic changes in BinderHub itself. Hence we abandon this strategy instead for pushing to quay.io or similar.

sgibson91 · 2023-01-26T17:18:42Z

@consideRatio and I debugged this a little today and ran up against a problem with ECR.

ECR as a container registry, compared to GCR/DockerHub/quay.io, requires the creation of a image repository before we push to it. This goes against the binderhub software's assumptions that you can push directly to a new repository name to define it.

This is documented by AWS as highlighted below:

Instead, we intend to use quay.io. Other federation members, such as Turing and GESIS, have used dockerhub in the past, but we have started using quay.io after dockerhub tightened their free tiers a bit. Both services are free and could provide us with a container registry. (We also did some digging and found that the original Pangeo AWS BinderHub was using dockerhub too, not ECR.) We have created the mybinder-org organisation on quay.io and added some of the JupyterHub team members over there to store the images BinderHub produces.

This decision unblocks this PR since we no longer need to create a role that can push to/pull from an ECR. We will instead create a robot account on quay.io and provide the username and password in encrypted config files read by the BinderHub helm chart (in a follow-up PR).

…/sgibson91/mybinder.org-deploy into sgibson91-add-federation-member/aws-pangeo

…r/aws-pangeo Resolve a merge conflict

minrk · 2023-03-31T07:28:52Z

OVH has been having quite a bit of trouble with the registry because OVH's private container registry has a very small size limit and seems to have lots of performance problems, so I think perhaps oVH should try quay.io as well, and see how it goes.

sgibson91 · 2023-04-18T15:42:11Z

I have just pushed the following commit which set up cluster-autoscaler as an optional dependency of the helm chart (because this is another thing we don't get out of the box with EKS)

I would've preferred to do this in a new PR, as this is starting to move out of terraform set-up and into helm chart config. But thought having the PR reflect the most recent status was more important.

for more information, see https://pre-commit.ci

sgibson91 · 2023-04-18T15:46:21Z

In the interest of getting unblocked, I've marked this as ready for review

yuvipanda

Awesome work! Left some comments :)

Does terraform apply run cleanly now?

terraform/aws/pangeo/main.tf

yuvipanda · 2023-04-19T06:10:34Z

terraform/aws/pangeo/main.tf

+
+  eks_managed_node_group_defaults = {
+    # Disabling and using externally provided security groups
+    create_security_group                 = false


Can you help me understand why we are managing security groups ourselves instead of having the module manage them for us?

I think to make certain that they were explicitly connected to the VPC I'd created by passing the ID. There's nowhere else in the terraform config for the eks module that says "attach this eks cluster to this vpc" so I guess I'm uncertain if that happens automatically and figured explicit is better than implicit.

Co-authored-by: Yuvi Panda <[email protected]>

We will be able to use AWS SSH to gain access to nodes instead, c.f. https://infrastructure.2i2c.org/en/latest/howto/troubleshoot/ssh.html#aws

sgibson91 · 2023-04-19T08:41:20Z

Does terraform apply run cleanly now?

@yuvipanda Yes, but that's because we dropped the problematic config, favouring a quay.io registry instead of an AWS ECR, c.f. #2467 (comment) If there's no ECR to push to, then there's no need to create the role that was causing terraform to not apply cleanly.

yuvipanda · 2023-04-19T09:13:28Z

Ah that's awesome! Using quay seems fine to me.

sgibson91 · 2023-04-19T09:35:10Z

Update: I just clocked the phrase "credits have depleted" in this Discourse post about taking down the Pangeo AWS deployments https://discourse.pangeo.io/t/aws-pangeo-jupyterhubs-to-shut-down-friday-march-17/3228

So maybe we need to find somewhere else to test this?

manics · 2023-04-19T09:59:46Z

I've suggested some steps in #2556 (comment)

sgibson91 · 2023-04-19T11:45:56Z

To be clear, this PR has been tested. Yuvi and I both have/had access to Scott's AWS account for the Pangeo Binder deployment and this terraform code applies cleanly over there. The problem is that we can't continue to use that AWS account to develop and test now the credits have depleted.

So for this PR specifically, I think I would like it merged or somehow resolved relatively soon so that 1) it doesn't need to continually be rebased, as it touches other terraform config to move them into subfolders, and 2) when we get access to another AWS account, someone can open a new PR and iterate as needed, without having to manage my PR as well. And I say someone because I don't believe I have the capacity to technically lead this.

manics · 2023-04-19T12:11:42Z

@sgibson91 Thanks for clarifying! I couldn't tell from all the comments what the state of this was. I don't feel able to review this without access to a deployment, but I'm happy for @yuvipanda to make the decision, and I can take on the follow-up work when we have AWS access.

yuvipanda · 2023-04-19T12:16:38Z

@manics we can give you access!

sgibson91 · 2023-04-19T12:18:20Z

Also, the PR only handles terraform, and a tiny bit of helm stuff regarding autoscaling. No deployment of BinderHub included here.

manics · 2023-04-19T12:22:54Z

@yuvipanda Thanks, but to avoid delaying this further, if you've already reviewed this I think we should merge it when you're happy with it.

yuvipanda · 2023-04-19T12:24:38Z

Great, done! We can iterate on this :)

Thanks for working on this, @sgibson91!

sgibson91 · 2023-04-19T12:25:09Z

Thank you everybody for your input and patience!

sgibson91 added 8 commits November 28, 2022 13:51

Move existing terraform files under a gcp folder

152bc0f

This allows for addition of aws-centric terraform files without conflicting with the gcp ones

Add/update READMEs

7395303

Update title of gcp/README.md to indicate that it is GCP-specific config; Add terraform/README.md to describe multi-cloud layout of config

Add tf config for vpc, subnets, sec grps, and eks cluster

a5da7d1

Config generated by following this tutorial: https://github.com/hashicorp/learn-terraform-provision-eks-cluster Makes use of eks and vpc terraform modules for AWS

Include required providers

5213c92

Create an image repo and output its URL

af2088d

Create an image repository in a container registry and output its URL. We do not need to create the container registry, since every AWS account comes with a private container registry by default.

Create an IAM role with permissions to manage the ECR contents

bc76515

Clean up some formatting

6c382eb

Create an IAM user with permissions to access the EKS cluster

233aa01

Create an IAM user with permissions to access the cluster for use from CI/CD. Output its key for storage.

sgibson91 mentioned this pull request Dec 15, 2022

New federation member on AWS backed by Pangeo #2449

Open

7 tasks

yuvipanda reviewed Dec 16, 2022

View reviewed changes

sgibson91 added 2 commits January 26, 2023 13:58

Output access key ID and secret for CI deployer IAM user

d5a0c9e

Remove the role for pushing to the ECR

96b533a

We will not be using ECR, there is a requirement for an image repo to already exist before an image can be pushed to it. This will require logic changes in BinderHub itself. Hence we abandon this strategy instead for pushing to quay.io or similar.

sgibson91 and others added 2 commits January 26, 2023 18:19

Merge branch 'add-federation-member/aws-pangeo' of https://github.com…

610be23

…/sgibson91/mybinder.org-deploy into sgibson91-add-federation-member/aws-pangeo

Merge pull request #115 from sgibson91/sgibson91-add-federation-membe…

e45208a

…r/aws-pangeo Resolve a merge conflict

sgibson91 mentioned this pull request Apr 18, 2023

Deploy a BinderHub on AWS and add it to the federation #2556

Closed

sgibson91 added 2 commits April 18, 2023 16:39

Add cluster-autoscaler as an optional dependency of the helm chart

a31b959

Add a minimal set of helm values for a pangeo binder deployment

5cf922c

sgibson91 and others added 2 commits April 18, 2023 16:44

Merge branch 'master' into add-federation-member/aws-pangeo

b07be8d

[pre-commit.ci] auto fixes from pre-commit.com hooks

8311f52

for more information, see https://pre-commit.ci

sgibson91 changed the title ~~[WIP] Add terraform config for an AWS cluster~~ Add terraform config for an AWS cluster Apr 18, 2023

sgibson91 marked this pull request as ready for review April 18, 2023 15:45

yuvipanda reviewed Apr 19, 2023

View reviewed changes

sgibson91 and others added 2 commits April 19, 2023 09:17

Bump k8s version

494a368

Co-authored-by: Yuvi Panda <[email protected]>

Remove ingress blocks

d6cbf1b

We will be able to use AWS SSH to gain access to nodes instead, c.f. https://infrastructure.2i2c.org/en/latest/howto/troubleshoot/ssh.html#aws

sgibson91 mentioned this pull request Apr 19, 2023

Prepare for end of gke.mybinder.org jupyterhub/team-compass#642

Open

6 tasks

yuvipanda merged commit 04f2157 into jupyterhub:main Apr 19, 2023

sgibson91 mentioned this pull request Apr 19, 2023

Extract-replicate our AWS cluster creation workflow into the mybinder.org-deploy repo 2i2c-org/infrastructure#1824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add terraform config for an AWS cluster #2467

Add terraform config for an AWS cluster #2467

sgibson91 commented Dec 15, 2022 •

edited

Loading

sgibson91 commented Dec 15, 2022

yuvipanda Dec 16, 2022

manics Dec 16, 2022 •

edited

Loading

sgibson91 Dec 16, 2022 •

edited

Loading

sgibson91 Dec 16, 2022

sgibson91 commented Jan 26, 2023

minrk commented Mar 31, 2023

sgibson91 commented Apr 18, 2023

sgibson91 commented Apr 18, 2023

yuvipanda left a comment

yuvipanda Apr 19, 2023

sgibson91 Apr 19, 2023

sgibson91 commented Apr 19, 2023 •

edited

Loading

yuvipanda commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

manics commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

manics commented Apr 19, 2023

yuvipanda commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

manics commented Apr 19, 2023

yuvipanda commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

Add terraform config for an AWS cluster #2467

Add terraform config for an AWS cluster #2467

Conversation

sgibson91 commented Dec 15, 2022 • edited Loading

sgibson91 commented Dec 15, 2022

yuvipanda Dec 16, 2022

Choose a reason for hiding this comment

manics Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

sgibson91 Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

sgibson91 Dec 16, 2022

Choose a reason for hiding this comment

sgibson91 commented Jan 26, 2023

minrk commented Mar 31, 2023

sgibson91 commented Apr 18, 2023

sgibson91 commented Apr 18, 2023

yuvipanda left a comment

Choose a reason for hiding this comment

yuvipanda Apr 19, 2023

Choose a reason for hiding this comment

sgibson91 Apr 19, 2023

Choose a reason for hiding this comment

sgibson91 commented Apr 19, 2023 • edited Loading

yuvipanda commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

manics commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

manics commented Apr 19, 2023

yuvipanda commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

manics commented Apr 19, 2023

yuvipanda commented Apr 19, 2023

sgibson91 commented Apr 19, 2023

sgibson91 commented Dec 15, 2022 •

edited

Loading

manics Dec 16, 2022 •

edited

Loading

sgibson91 Dec 16, 2022 •

edited

Loading

sgibson91 commented Apr 19, 2023 •

edited

Loading