Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add terraform config for an AWS cluster #2467

Merged
merged 18 commits into from
Apr 19, 2023

Conversation

sgibson91
Copy link
Member

@sgibson91 sgibson91 commented Dec 15, 2022

related to #2449

This PR adds terraform config to deploy a k8s cluster into an AWS account, and also adds the cluster-autoscaler helm chart as a dependency to the my binder helm chart as autoscaling is not something that comes out of the box with eks.

Currently the terraform code assumes no ECR and an external registry such as quay.io will be used instead. A storage bucket could be added in the future to store the terraform state. Also some input regarding which AWS account to use (currently relies on environment variables).


The below error was avoided by removing the problematic config. By choosing an external container registry over an ECR, there was no need to create the role. #2467 (comment)


Following https://github.com/hashicorp/learn-terraform-provision-eks-cluster and other bits and pieces, this is the terraform config I've come up with so far to deploy a cluster and supporting infrastructure on AWS.

Currently I get the following error when deploying this:

╷
│ Error: failed creating IAM Role (pangeo-mybinder-k8s-ecr-iam-role): MalformedPolicyDocument: Has prohibited field Resource
│ 	status code: 400, request id: 6e4da11d-7d10-4d2c-84c6-72cf624bca4d
│
│   with aws_iam_role.k8s_ecr_iam_role,
│   on main.tf line 140, in resource "aws_iam_role" "k8s_ecr_iam_role":
│  140: resource "aws_iam_role" "k8s_ecr_iam_role" {
│
╵

The policy in question was retrieved from this section of some as-yet-unmerged documentation, so it could be that it's just not correct to be using the Resource attribute here:

This allows for addition of aws-centric terraform files without conflicting with the gcp ones
Update title of gcp/README.md to indicate that it is GCP-specific config; Add terraform/README.md to describe multi-cloud layout of config
Config generated by following this tutorial: https://github.com/hashicorp/learn-terraform-provision-eks-cluster Makes use of eks and vpc terraform modules for AWS
Create an image repository in a container registry and output its URL. We do not need to create the container registry, since every AWS account comes with a private container registry by default.
Create an IAM user with permissions to access the cluster for use from CI/CD. Output its key for storage.
@sgibson91
Copy link
Member Author

The two READMEs look like I've edited more than I did. What I did was:

  • Move the original terraform/README.md to terraform/gcp/README.md and changed the title to indicate that it refers to GKE only
  • Create a new terraform/README.md file explaining the need for different folders/config for different cloud providers

"sts:AssumeRole"
],
"Effect": "Allow",
"Resource": "${aws_ecr_repository.image_repo.arn}*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily, does getting rid of all the "Resource" fields or converting their values to "*" make the config apply?

Copy link
Member

@manics manics Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the assume role policy so it say which entities (e.g. users, or other AWS services) are allowed to assume the role. E.g. the example in https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role#basic-example allows the role to be used as an EC2 instance role. The privileges for the role are specified separately, either using inline_policy, or in a separate aws_iam_policy linked to the role using aws_iam_policy_attachment.

If you want to use an IRSA role which you can link directly to a K8s service account ... it's more complicated as you need an K8s OIDC provider to act as the entity that assumes the role. I've got an example somewhere which I can dig out if you want?

Copy link
Member Author

@sgibson91 sgibson91 Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Setting all "Resource" values to "*" produces the same error
  • Got a different error by removing all instances of "Resource"
│ Error: failed creating IAM Role (pangeo-mybinder-k8s-ecr-iam-role): MalformedPolicyDocument: Missing required field Principal
│ 	status code: 400, request id: be65c496-2b4c-4dbf-8676-8a2b8e29bf5e
│
│   with aws_iam_role.k8s_ecr_iam_role,
│   on main.tf line 140, in resource "aws_iam_role" "k8s_ecr_iam_role":
│  140: resource "aws_iam_role" "k8s_ecr_iam_role" {
│

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I don't know how to move this forward otherwise. Though I won't be doing any more work here until the new year.

We will not be using ECR, there is a requirement for an image repo to
already exist before an image can be pushed to it. This will require
logic changes in BinderHub itself. Hence we abandon this strategy instead
for pushing to quay.io or similar.
@sgibson91
Copy link
Member Author

@consideRatio and I debugged this a little today and ran up against a problem with ECR.

ECR as a container registry, compared to GCR/DockerHub/quay.io, requires the creation of a image repository before we push to it. This goes against the binderhub software's assumptions that you can push directly to a new repository name to define it.

This is documented by AWS as highlighted below:

Instead, we intend to use quay.io. Other federation members, such as Turing and GESIS, have used dockerhub in the past, but we have started using quay.io after dockerhub tightened their free tiers a bit. Both services are free and could provide us with a container registry. (We also did some digging and found that the original Pangeo AWS BinderHub was using dockerhub too, not ECR.) We have created the mybinder-org organisation on quay.io and added some of the JupyterHub team members over there to store the images BinderHub produces.

This decision unblocks this PR since we no longer need to create a role that can push to/pull from an ECR. We will instead create a robot account on quay.io and provide the username and password in encrypted config files read by the BinderHub helm chart (in a follow-up PR).

@minrk
Copy link
Member

minrk commented Mar 31, 2023

OVH has been having quite a bit of trouble with the registry because OVH's private container registry has a very small size limit and seems to have lots of performance problems, so I think perhaps oVH should try quay.io as well, and see how it goes.

@sgibson91
Copy link
Member Author

I have just pushed the following commit which set up cluster-autoscaler as an optional dependency of the helm chart (because this is another thing we don't get out of the box with EKS)

I would've preferred to do this in a new PR, as this is starting to move out of terraform set-up and into helm chart config. But thought having the PR reflect the most recent status was more important.

@sgibson91 sgibson91 changed the title [WIP] Add terraform config for an AWS cluster Add terraform config for an AWS cluster Apr 18, 2023
@sgibson91 sgibson91 marked this pull request as ready for review April 18, 2023 15:45
@sgibson91
Copy link
Member Author

In the interest of getting unblocked, I've marked this as ready for review

Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Left some comments :)

Does terraform apply run cleanly now?

terraform/aws/pangeo/main.tf Outdated Show resolved Hide resolved
terraform/aws/pangeo/main.tf Outdated Show resolved Hide resolved
terraform/aws/pangeo/main.tf Outdated Show resolved Hide resolved

eks_managed_node_group_defaults = {
# Disabling and using externally provided security groups
create_security_group = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why we are managing security groups ourselves instead of having the module manage them for us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to make certain that they were explicitly connected to the VPC I'd created by passing the ID. There's nowhere else in the terraform config for the eks module that says "attach this eks cluster to this vpc" so I guess I'm uncertain if that happens automatically and figured explicit is better than implicit.

sgibson91 and others added 2 commits April 19, 2023 09:17
Co-authored-by: Yuvi Panda <[email protected]>
We will be able to use AWS SSH to gain access to nodes instead, c.f.
https://infrastructure.2i2c.org/en/latest/howto/troubleshoot/ssh.html#aws
@sgibson91
Copy link
Member Author

sgibson91 commented Apr 19, 2023

Does terraform apply run cleanly now?

@yuvipanda Yes, but that's because we dropped the problematic config, favouring a quay.io registry instead of an AWS ECR, c.f. #2467 (comment) If there's no ECR to push to, then there's no need to create the role that was causing terraform to not apply cleanly.

@yuvipanda
Copy link
Contributor

Ah that's awesome! Using quay seems fine to me.

@sgibson91
Copy link
Member Author

Update: I just clocked the phrase "credits have depleted" in this Discourse post about taking down the Pangeo AWS deployments https://discourse.pangeo.io/t/aws-pangeo-jupyterhubs-to-shut-down-friday-march-17/3228

So maybe we need to find somewhere else to test this?

@manics
Copy link
Member

manics commented Apr 19, 2023

I've suggested some steps in #2556 (comment)

@sgibson91
Copy link
Member Author

To be clear, this PR has been tested. Yuvi and I both have/had access to Scott's AWS account for the Pangeo Binder deployment and this terraform code applies cleanly over there. The problem is that we can't continue to use that AWS account to develop and test now the credits have depleted.

So for this PR specifically, I think I would like it merged or somehow resolved relatively soon so that 1) it doesn't need to continually be rebased, as it touches other terraform config to move them into subfolders, and 2) when we get access to another AWS account, someone can open a new PR and iterate as needed, without having to manage my PR as well. And I say someone because I don't believe I have the capacity to technically lead this.

@manics
Copy link
Member

manics commented Apr 19, 2023

@sgibson91 Thanks for clarifying! I couldn't tell from all the comments what the state of this was. I don't feel able to review this without access to a deployment, but I'm happy for @yuvipanda to make the decision, and I can take on the follow-up work when we have AWS access.

@yuvipanda
Copy link
Contributor

@manics we can give you access!

@sgibson91
Copy link
Member Author

Also, the PR only handles terraform, and a tiny bit of helm stuff regarding autoscaling. No deployment of BinderHub included here.

@manics
Copy link
Member

manics commented Apr 19, 2023

@yuvipanda Thanks, but to avoid delaying this further, if you've already reviewed this I think we should merge it when you're happy with it.

@yuvipanda yuvipanda merged commit 04f2157 into jupyterhub:main Apr 19, 2023
@yuvipanda
Copy link
Contributor

Great, done! We can iterate on this :)

Thanks for working on this, @sgibson91!

@sgibson91
Copy link
Member Author

Thank you everybody for your input and patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants