Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow enabling usage logs on GCS storage buckets & enable for LEAP #3407

Merged
merged 3 commits into from
Nov 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 41 additions & 7 deletions docs/howto/features/buckets.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,40 @@ user_buckets = {
}
```

## Enable access logs for objects in a bucket

### GCP

We may want to know *what* objects in a bucket are actually being accessed,
and when. While there is not a systematic way to do a 'when was this
object last accessed', we can instead enable [usage logs](https://cloud.google.com/storage/docs/access-logs)
that allow hub administrators to get access to some raw data.

Note that we currently can not actually help hub admins process these
logs - that is *their* responsibility. We can only *enable* this logging.
Comment on lines +109 to +110
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clarifying responsibility as done here as part of introducing this feature is super important - nice!!!


This can be enabled by setting `usage_logs` parameter in `user_buckets`
for the appropriate bucket, and running `terraform apply`.

Example:

```terraform
user_buckets = {
"persistent": {
"usage_logs": true
}
}
```

Once enabled, you can find out what bucket the access logs will be sent
to with `terraform output usage_log_bucket`. The access logs will by
default be deleted after 30 days, to avoid them costing too much money.

The logs are in CSV format, with the fields [documented here](https://cloud.google.com/storage/docs/access-logs#format).
We suggest that hub admins interested can [download the logs](https://cloud.google.com/storage/docs/access-logs#downloading)
and parse them as they wish - this is not something that we can currently help
much with.

## Allowing authenticated access to buckets from outside the JupyterHub

### GCP
Expand All @@ -108,15 +142,15 @@ GCS buckets, it can be used to allow arbitrary users to write to the bucket!

1. With your `2i2c.org` google account, go to [Google Groups](https://groups.google.com) and create a new Google Group with the name
"<bucket-name>-writers", where "<bucket-name>" is the name of the bucket
we are going to grant write access to.
we are going to grant write access to.

2. Grant "Group Owner" access to the community champion requesting this feature.
They will be able to add / remove users from the group as necessary, and
thus manage access without needing to involve 2i2c engineers.

3. In the `user_buckets` definition for the bucket in question, add the group
name as an `extra_admin_members`:

```terraform
user_buckets = {
"persistent": {
Expand All @@ -127,16 +161,16 @@ GCS buckets, it can be used to allow arbitrary users to write to the bucket!
}
}
```

Apply this terraform change to create the appropriate permissions for members
of the group to have full read/write access to that GCS bucket.

4. We want the community champions to handle granting / revoking access to
this google group, as well as produce community specific documentation on
how to actually upload data here. We currently do not have a template of
how end users can use this, but something can be stolen from the
how end users can use this, but something can be stolen from the
[documentation for LEAP users](https://leap-stc.github.io/leap-pangeo/jupyterhub.html#i-have-a-dataset-and-want-to-work-with-it-on-the-hub-how-do-i-upload-it)

## Granting access to cloud buckets in other cloud accounts / projects

Sometimes, users on a hub we manage need access to a storage bucket
Expand Down
53 changes: 53 additions & 0 deletions terraform/gcp/buckets.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ resource "google_storage_bucket" "user_buckets" {
// Set these values explicitly so they don't "change outside terraform"
labels = {}

dynamic "logging" {
for_each = each.value.usage_logs ? [1] : []

content {
log_bucket = google_storage_bucket.usage_logs_bucket.name
}
}

dynamic "lifecycle_rule" {
for_each = each.value.delete_after != null ? [1] : []

Expand All @@ -26,6 +34,40 @@ resource "google_storage_bucket" "user_buckets" {
}
}

# Create GCS bucket that can store *usage* logs (access logs).
# Helpful to see what data is *actually* being used.
# https://cloud.google.com/storage/docs/access-logs
#
# We create this bucket unconditionally, because it costs nothing.
# It only costs if we actually enable this logging, which is done in
# per-bucket config.
#
# We only keep them for 30 days so they don't end up costing a
# ton of money
resource "google_storage_bucket" "usage_logs_bucket" {
name = "${var.prefix}-gcs-usages-logs"
location = var.region
project = var.project_id

labels = {}

lifecycle_rule {
condition {
age = "30d"
}
action {
type = "Delete"
}
}
}

# Provide access to GCS infrastructure to write usage logs to this bucket
resource "google_storage_bucket_iam_member" "usage_logs_bucket_access" {
bucket = google_storage_bucket.logging_bucket.name
member = "group:[email protected]"
role = "roles/storage.objectCreator"
}

locals {
# Nested for loop, thanks to https://www.daveperrett.com/articles/2021/08/19/nested-for-each-with-terraform/
bucket_admin_permissions = distinct(flatten([
Expand Down Expand Up @@ -95,3 +137,14 @@ output "buckets" {
the full name of all GCS buckets created conveniently.
EOT
}

output "usage_log_bucket" {
value = google_storage_bucket.usage_logs_bucket.name
description = <<-EOT
Name of GCS bucket containing GCS usage logs (when enabled).

https://cloud.google.com/storage/docs/access-logs has more information
on GCS usage logs. It has to be enabled on a per-bucket basis - see
the documentation for the `user_buckets` variable for more information.
EOT
}
14 changes: 10 additions & 4 deletions terraform/gcp/projects/leap.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -30,29 +30,35 @@ filestore_capacity_gb = 2048
user_buckets = {
"scratch-staging" : {
"delete_after" : 7,
"extra_admin_members" : []
"extra_admin_members" : [],
"usage_logs" : true,
},
"scratch" : {
"delete_after" : 7,
"extra_admin_members" : []
"extra_admin_members" : [],
"usage_logs" : true,
}
# For https://github.com/2i2c-org/infrastructure/issues/1230#issuecomment-1278183441
"persistent" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"]
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
},
"persistent-staging" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"]
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
}
# For https://github.com/2i2c-org/infrastructure/issues/1230#issuecomment-1278183441
"persistent-ro" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
},
"persistent-ro-staging" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
}
}

Expand Down
17 changes: 15 additions & 2 deletions terraform/gcp/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,12 @@ variable "enable_network_policy" {
}

variable "user_buckets" {
type = map(object({ delete_after : number, extra_admin_members : optional(list(string), []), public_access : optional(bool, false) }))
type = map(object({
delete_after : number,
extra_admin_members : optional(list(string), []),
public_access : optional(bool, false),
usage_logs : optional(bool, false)
}))
default = {}
description = <<-EOT
GCS Buckets to be created.
Expand All @@ -273,7 +278,15 @@ variable "user_buckets" {
for the format this would be specified in.

'public_access', if set to true, makes the bucket fully accessible to
the public internet, without any authentication.
the public internet, without any authentication. This must be used with
*extreme* care as it can explode network egress costs immensely, and
should only be allowed on projects where 2i2c is *not* paying the cloud
bill to be recouped later.

'usage_logs', if set to true, will write GCS usage logs detailing
access to objects in this GCS bucket. https://cloud.google.com/storage/docs/access-logs
has more details. The bucket these will be written to can be determined by
the `usage_logs_bucket` terraform output variable.
EOT
}

Expand Down