Skip to content

Commit

Permalink
Merge pull request #3407 from yuvipanda/usage-logs
Browse files Browse the repository at this point in the history
Allow enabling usage logs on GCS storage buckets & enable for LEAP
  • Loading branch information
yuvipanda authored Nov 13, 2023
2 parents d19c9da + 69f498b commit 034e919
Show file tree
Hide file tree
Showing 4 changed files with 119 additions and 13 deletions.
48 changes: 41 additions & 7 deletions docs/howto/features/buckets.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,40 @@ user_buckets = {
}
```

## Enable access logs for objects in a bucket

### GCP

We may want to know *what* objects in a bucket are actually being accessed,
and when. While there is not a systematic way to do a 'when was this
object last accessed', we can instead enable [usage logs](https://cloud.google.com/storage/docs/access-logs)
that allow hub administrators to get access to some raw data.

Note that we currently can not actually help hub admins process these
logs - that is *their* responsibility. We can only *enable* this logging.

This can be enabled by setting `usage_logs` parameter in `user_buckets`
for the appropriate bucket, and running `terraform apply`.

Example:

```terraform
user_buckets = {
"persistent": {
"usage_logs": true
}
}
```

Once enabled, you can find out what bucket the access logs will be sent
to with `terraform output usage_log_bucket`. The access logs will by
default be deleted after 30 days, to avoid them costing too much money.

The logs are in CSV format, with the fields [documented here](https://cloud.google.com/storage/docs/access-logs#format).
We suggest that hub admins interested can [download the logs](https://cloud.google.com/storage/docs/access-logs#downloading)
and parse them as they wish - this is not something that we can currently help
much with.

## Allowing authenticated access to buckets from outside the JupyterHub

### GCP
Expand All @@ -108,15 +142,15 @@ GCS buckets, it can be used to allow arbitrary users to write to the bucket!

1. With your `2i2c.org` google account, go to [Google Groups](https://groups.google.com) and create a new Google Group with the name
"<bucket-name>-writers", where "<bucket-name>" is the name of the bucket
we are going to grant write access to.
we are going to grant write access to.

2. Grant "Group Owner" access to the community champion requesting this feature.
They will be able to add / remove users from the group as necessary, and
thus manage access without needing to involve 2i2c engineers.

3. In the `user_buckets` definition for the bucket in question, add the group
name as an `extra_admin_members`:

```terraform
user_buckets = {
"persistent": {
Expand All @@ -127,16 +161,16 @@ GCS buckets, it can be used to allow arbitrary users to write to the bucket!
}
}
```

Apply this terraform change to create the appropriate permissions for members
of the group to have full read/write access to that GCS bucket.

4. We want the community champions to handle granting / revoking access to
this google group, as well as produce community specific documentation on
how to actually upload data here. We currently do not have a template of
how end users can use this, but something can be stolen from the
how end users can use this, but something can be stolen from the
[documentation for LEAP users](https://leap-stc.github.io/leap-pangeo/jupyterhub.html#i-have-a-dataset-and-want-to-work-with-it-on-the-hub-how-do-i-upload-it)

## Granting access to cloud buckets in other cloud accounts / projects

Sometimes, users on a hub we manage need access to a storage bucket
Expand Down
53 changes: 53 additions & 0 deletions terraform/gcp/buckets.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ resource "google_storage_bucket" "user_buckets" {
// Set these values explicitly so they don't "change outside terraform"
labels = {}

dynamic "logging" {
for_each = each.value.usage_logs ? [1] : []

content {
log_bucket = google_storage_bucket.usage_logs_bucket.name
}
}

dynamic "lifecycle_rule" {
for_each = each.value.delete_after != null ? [1] : []

Expand All @@ -26,6 +34,40 @@ resource "google_storage_bucket" "user_buckets" {
}
}

# Create GCS bucket that can store *usage* logs (access logs).
# Helpful to see what data is *actually* being used.
# https://cloud.google.com/storage/docs/access-logs
#
# We create this bucket unconditionally, because it costs nothing.
# It only costs if we actually enable this logging, which is done in
# per-bucket config.
#
# We only keep them for 30 days so they don't end up costing a
# ton of money
resource "google_storage_bucket" "usage_logs_bucket" {
name = "${var.prefix}-gcs-usages-logs"
location = var.region
project = var.project_id

labels = {}

lifecycle_rule {
condition {
age = "30d"
}
action {
type = "Delete"
}
}
}

# Provide access to GCS infrastructure to write usage logs to this bucket
resource "google_storage_bucket_iam_member" "usage_logs_bucket_access" {
bucket = google_storage_bucket.logging_bucket.name
member = "group:[email protected]"
role = "roles/storage.objectCreator"
}

locals {
# Nested for loop, thanks to https://www.daveperrett.com/articles/2021/08/19/nested-for-each-with-terraform/
bucket_admin_permissions = distinct(flatten([
Expand Down Expand Up @@ -95,3 +137,14 @@ output "buckets" {
the full name of all GCS buckets created conveniently.
EOT
}

output "usage_log_bucket" {
value = google_storage_bucket.usage_logs_bucket.name
description = <<-EOT
Name of GCS bucket containing GCS usage logs (when enabled).
https://cloud.google.com/storage/docs/access-logs has more information
on GCS usage logs. It has to be enabled on a per-bucket basis - see
the documentation for the `user_buckets` variable for more information.
EOT
}
14 changes: 10 additions & 4 deletions terraform/gcp/projects/leap.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -30,29 +30,35 @@ filestore_capacity_gb = 2048
user_buckets = {
"scratch-staging" : {
"delete_after" : 7,
"extra_admin_members" : []
"extra_admin_members" : [],
"usage_logs" : true,
},
"scratch" : {
"delete_after" : 7,
"extra_admin_members" : []
"extra_admin_members" : [],
"usage_logs" : true,
}
# For https://github.com/2i2c-org/infrastructure/issues/1230#issuecomment-1278183441
"persistent" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"]
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
},
"persistent-staging" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"]
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
}
# For https://github.com/2i2c-org/infrastructure/issues/1230#issuecomment-1278183441
"persistent-ro" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
},
"persistent-ro-staging" : {
"delete_after" : null,
"extra_admin_members" : ["group:[email protected]"],
"usage_logs" : true,
}
}

Expand Down
17 changes: 15 additions & 2 deletions terraform/gcp/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,12 @@ variable "enable_network_policy" {
}

variable "user_buckets" {
type = map(object({ delete_after : number, extra_admin_members : optional(list(string), []), public_access : optional(bool, false) }))
type = map(object({
delete_after : number,
extra_admin_members : optional(list(string), []),
public_access : optional(bool, false),
usage_logs : optional(bool, false)
}))
default = {}
description = <<-EOT
GCS Buckets to be created.
Expand All @@ -273,7 +278,15 @@ variable "user_buckets" {
for the format this would be specified in.
'public_access', if set to true, makes the bucket fully accessible to
the public internet, without any authentication.
the public internet, without any authentication. This must be used with
*extreme* care as it can explode network egress costs immensely, and
should only be allowed on projects where 2i2c is *not* paying the cloud
bill to be recouped later.
'usage_logs', if set to true, will write GCS usage logs detailing
access to objects in this GCS bucket. https://cloud.google.com/storage/docs/access-logs
has more details. The bucket these will be written to can be determined by
the `usage_logs_bucket` terraform output variable.
EOT
}

Expand Down

0 comments on commit 034e919

Please sign in to comment.