-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add terraform config for an AWS cluster #2467
Add terraform config for an AWS cluster #2467
Conversation
This allows for addition of aws-centric terraform files without conflicting with the gcp ones
Update title of gcp/README.md to indicate that it is GCP-specific config; Add terraform/README.md to describe multi-cloud layout of config
Config generated by following this tutorial: https://github.com/hashicorp/learn-terraform-provision-eks-cluster Makes use of eks and vpc terraform modules for AWS
Create an image repository in a container registry and output its URL. We do not need to create the container registry, since every AWS account comes with a private container registry by default.
Create an IAM user with permissions to access the cluster for use from CI/CD. Output its key for storage.
The two READMEs look like I've edited more than I did. What I did was:
|
terraform/aws/pangeo/main.tf
Outdated
"sts:AssumeRole" | ||
], | ||
"Effect": "Allow", | ||
"Resource": "${aws_ecr_repository.image_repo.arn}*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily, does getting rid of all the "Resource" fields or converting their values to "*" make the config apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the assume role policy so it say which entities (e.g. users, or other AWS services) are allowed to assume the role. E.g. the example in https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role#basic-example allows the role to be used as an EC2 instance role. The privileges for the role are specified separately, either using inline_policy
, or in a separate aws_iam_policy
linked to the role using aws_iam_policy_attachment
.
If you want to use an IRSA role which you can link directly to a K8s service account ... it's more complicated as you need an K8s OIDC provider to act as the entity that assumes the role. I've got an example somewhere which I can dig out if you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Setting all "Resource" values to "*" produces the same error
- Got a different error by removing all instances of "Resource"
│ Error: failed creating IAM Role (pangeo-mybinder-k8s-ecr-iam-role): MalformedPolicyDocument: Missing required field Principal
│ status code: 400, request id: be65c496-2b4c-4dbf-8676-8a2b8e29bf5e
│
│ with aws_iam_role.k8s_ecr_iam_role,
│ on main.tf line 140, in resource "aws_iam_role" "k8s_ecr_iam_role":
│ 140: resource "aws_iam_role" "k8s_ecr_iam_role" {
│
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I don't know how to move this forward otherwise. Though I won't be doing any more work here until the new year.
We will not be using ECR, there is a requirement for an image repo to already exist before an image can be pushed to it. This will require logic changes in BinderHub itself. Hence we abandon this strategy instead for pushing to quay.io or similar.
@consideRatio and I debugged this a little today and ran up against a problem with ECR. ECR as a container registry, compared to GCR/DockerHub/quay.io, requires the creation of a image repository before we push to it. This goes against the binderhub software's assumptions that you can push directly to a new repository name to define it. This is documented by AWS as highlighted below: Instead, we intend to use quay.io. Other federation members, such as Turing and GESIS, have used dockerhub in the past, but we have started using quay.io after dockerhub tightened their free tiers a bit. Both services are free and could provide us with a container registry. (We also did some digging and found that the original Pangeo AWS BinderHub was using dockerhub too, not ECR.) We have created the mybinder-org organisation on quay.io and added some of the JupyterHub team members over there to store the images BinderHub produces. This decision unblocks this PR since we no longer need to create a role that can push to/pull from an ECR. We will instead create a robot account on quay.io and provide the username and password in encrypted config files read by the BinderHub helm chart (in a follow-up PR). |
…/sgibson91/mybinder.org-deploy into sgibson91-add-federation-member/aws-pangeo
…r/aws-pangeo Resolve a merge conflict
OVH has been having quite a bit of trouble with the registry because OVH's private container registry has a very small size limit and seems to have lots of performance problems, so I think perhaps oVH should try quay.io as well, and see how it goes. |
I have just pushed the following commit which set up cluster-autoscaler as an optional dependency of the helm chart (because this is another thing we don't get out of the box with EKS) I would've preferred to do this in a new PR, as this is starting to move out of terraform set-up and into helm chart config. But thought having the PR reflect the most recent status was more important. |
In the interest of getting unblocked, I've marked this as ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Left some comments :)
Does terraform apply run cleanly now?
|
||
eks_managed_node_group_defaults = { | ||
# Disabling and using externally provided security groups | ||
create_security_group = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand why we are managing security groups ourselves instead of having the module manage them for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to make certain that they were explicitly connected to the VPC I'd created by passing the ID. There's nowhere else in the terraform config for the eks module that says "attach this eks cluster to this vpc" so I guess I'm uncertain if that happens automatically and figured explicit is better than implicit.
Co-authored-by: Yuvi Panda <[email protected]>
We will be able to use AWS SSH to gain access to nodes instead, c.f. https://infrastructure.2i2c.org/en/latest/howto/troubleshoot/ssh.html#aws
@yuvipanda Yes, but that's because we dropped the problematic config, favouring a quay.io registry instead of an AWS ECR, c.f. #2467 (comment) If there's no ECR to push to, then there's no need to create the role that was causing terraform to not apply cleanly. |
Ah that's awesome! Using quay seems fine to me. |
Update: I just clocked the phrase "credits have depleted" in this Discourse post about taking down the Pangeo AWS deployments https://discourse.pangeo.io/t/aws-pangeo-jupyterhubs-to-shut-down-friday-march-17/3228 So maybe we need to find somewhere else to test this? |
I've suggested some steps in #2556 (comment) |
To be clear, this PR has been tested. Yuvi and I both have/had access to Scott's AWS account for the Pangeo Binder deployment and this terraform code applies cleanly over there. The problem is that we can't continue to use that AWS account to develop and test now the credits have depleted. So for this PR specifically, I think I would like it merged or somehow resolved relatively soon so that 1) it doesn't need to continually be rebased, as it touches other terraform config to move them into subfolders, and 2) when we get access to another AWS account, someone can open a new PR and iterate as needed, without having to manage my PR as well. And I say someone because I don't believe I have the capacity to technically lead this. |
@sgibson91 Thanks for clarifying! I couldn't tell from all the comments what the state of this was. I don't feel able to review this without access to a deployment, but I'm happy for @yuvipanda to make the decision, and I can take on the follow-up work when we have AWS access. |
@manics we can give you access! |
Also, the PR only handles terraform, and a tiny bit of helm stuff regarding autoscaling. No deployment of BinderHub included here. |
@yuvipanda Thanks, but to avoid delaying this further, if you've already reviewed this I think we should merge it when you're happy with it. |
Great, done! We can iterate on this :) Thanks for working on this, @sgibson91! |
Thank you everybody for your input and patience! |
related to #2449
This PR adds terraform config to deploy a k8s cluster into an AWS account, and also adds the cluster-autoscaler helm chart as a dependency to the my binder helm chart as autoscaling is not something that comes out of the box with eks.
Currently the terraform code assumes no ECR and an external registry such as quay.io will be used instead. A storage bucket could be added in the future to store the terraform state. Also some input regarding which AWS account to use (currently relies on environment variables).
The below error was avoided by removing the problematic config. By choosing an external container registry over an ECR, there was no need to create the role. #2467 (comment)
Following https://github.com/hashicorp/learn-terraform-provision-eks-cluster and other bits and pieces, this is the terraform config I've come up with so far to deploy a cluster and supporting infrastructure on AWS.
Currently I get the following error when deploying this:
The policy in question was retrieved from this section of some as-yet-unmerged documentation, so it could be that it's just not correct to be using the Resource attribute here: