From cb502fa66259f332f548bbca5ff3dc236bf58d4b Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Tue, 5 Mar 2024 21:53:01 -0800 Subject: [PATCH 01/36] Add topic documentation on cloud billing concepts Cloud billing is a vast area with many nuances. However, because of our usage patterns, we only need to understand a small-ish subset of it. This needs to be a shared understanding, so we all have a common base to reason from. Fixes https://github.com/2i2c-org/infrastructure/issues/3760 --- docs/index.md | 1 + docs/topic/billing/accounts.md | 155 +++++++++ docs/topic/billing/index.md | 11 + docs/topic/billing/things.md | 610 +++++++++++++++++++++++++++++++++ docs/topic/billing/tools.md | 58 ++++ 5 files changed, 835 insertions(+) create mode 100644 docs/topic/billing/accounts.md create mode 100644 docs/topic/billing/index.md create mode 100644 docs/topic/billing/things.md create mode 100644 docs/topic/billing/tools.md diff --git a/docs/index.md b/docs/index.md index c946599a54..5caba2f2c3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -88,6 +88,7 @@ Topic guides go more in-depth on a particular topic. :maxdepth: 2 topic/access-creds/index.md topic/infrastructure/index.md +topic/billing/index.md topic/monitoring-alerting/index.md topic/features.md topic/resource-allocation.md diff --git a/docs/topic/billing/accounts.md b/docs/topic/billing/accounts.md new file mode 100644 index 0000000000..8eb9066aff --- /dev/null +++ b/docs/topic/billing/accounts.md @@ -0,0 +1,155 @@ +# Billing accounts + +Cloud providers have different ways of attaching a source of money +(a credit card or cloud credits) to the resources we get charged for. + +## GCP + +A [Billing Account](https://cloud.google.com/billing/docs/how-to/manage-billing-account) +is the unit of billing in GCP. It controls: + +1. The **source** of funds - credit cards, cloud credits, other invoicing mechanisms +2. Detailed reports about how much money is being spent +3. Access control to determine who has access to view the reports, attach new projects to the billing account, etc +4. Configuration for where data about what we are being billed for should be + exported to. See the "Programmatic access to billing information" section + for more details. + +Multiple [Projects](https://cloud.google.com/storage/docs/projects) can +be attached to a single billing account, and projects can usually move +between billing accounts. All cloud resources (clusters, nodes, object storage, +etc) are always contained inside a Project. + +Billing accounts are generally not created often. Usually you would need to +create a new one for the following reasons: + +1. You are getting cloud credits from somewhere, and want a separate container + for it so you can track spend accurately. +2. You want to use a different credit card than what you use for your other + billing account. + +## AWS + +AWS billing is a little more haphazard than GCP's, and derived partially +from the well established monstrosity that is [LDAP](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol). + +We will not dig too deep into it, lest we wake some demons. However, since +we are primarily concerned with billing, understanding the following concepts +is more than enough. + +1. An [AWS Account](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_getting-started_concepts.html#account) + contains all the cloud resources one may create. +2. An [AWS Organization](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html) + is a way to *group* multiple AWS accounts together. One AWS Account in + any AWS Organization is deemed the [management account](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_getting-started_concepts.html#account). + The management account sets up access control, billing access, billing + data export, etc for the entire organization. + +Each AWS Account can have billing attached in one of two ways: + +1. Directly attached to the account, via a credit card or credits +2. If the account is part of an AWS Organization, it can use the billing + information of the *Management Account* of the organization it is a + part of. This is known as [consolidated billing](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/consolidated-billing.html) + +So billing information is *always* associated with an AWS Account, +*not* AWS Organization. This is a historical artifact - AWS Organizations +were only announced [in 2016](https://aws.amazon.com/about-aws/whats-new/2016/11/announcing-aws-organizations-now-in-preview/), +a good 9 years after AWS started. So if this feels a little hacked on, +it is because it is! + +A pattern many organizations (including 2i2c) follow is to have an AWS Account +that is completely empty and unused, except for being designated as the +management account. Billing information (and credits) for the entire +organization is attached to this account. This makes access control much +simpler, as the only people who get access to this AWS account are those +who need to handle billing. + +## Azure + +We currently have no experience or knowledge here, as all our Azure +customers handle billing themselves. This is not uncommon - most Azure +users want to use it because they already have a pre-existing *strong* +relationship with Microsoft, and their billing gets managed as part of +that. + +## Billing reports in the web console + +Cloud providers provide a billing console accessible via the web, often +with very helpful reports. While not automation friendly (see next section +for more information on automation), this is very helpful for doing ad-hoc +reporting as well as trying to optimize cloud costs. For dedicated clusters, +this may also be directly accessible to hub champions, allowing them to +explore their own usage. + +### GCP + +You can get a list of all billing accounts you have access to [here](https://console.cloud.google.com/billing) +(make sure you are logged in to the correct google account). Alternatively, +you can select 'Billing' from the left sidebar in the Google Cloud Console +after selecting the correct project. + +Once you have selected a billing account, you can access a reporting +interface by selecting 'Reports' on the left sidebar. This lets you group +costs in different ways, and explore that over different time periods. +Grouping and filtering by project and service (representing +what kind of cloud product is costing us what) are the *most useful* aspects +here. You can also download whatever you see as a CSV, although programmatic +access to most of the reports here is unfortunately limited. + +### AWS + +For AWS, "Billing and Cost Management" is accessed from specific AWS +accounts. If using consolidated billing, it must be accessed from the +'management account' for that particular organization (see previous section +for more information). + +You can access "Billing and Cost Management" by typing "Billing and Cost +Management" into the top search bar once you're logged into the correct +AWS account. AWS has a million ways for you to slice and dice these reports, +but the most helpful place to start is the "Cost Explorer Saved Reports". +This provides 'Monthly costs by linked account' and 'Monthly costs by +service'. Once you open those reports, you can further filter and group +as you wish. A CSV of reports can also be downloaded here. + +### Azure + +We have limited expertise here, as we have never actually had billing +access to any Azure account! + +## Programmatic access to billing information + +The APIs for getting access to billing data *somehow* seem to be the +least well developed parts of any cloud vendor's offerings. Usually +they take the form of a 'billing export' - you set up a *destination* +where information about billing is written, often once a day. And then +you query this location for billing information. This is in sharp contrast +to most other cloud APIs, where the cloud vendor has a *single source of +truth* you can just query. Not so for billing - there's no external API +access to the various tools they seem to be able to use to provide the +web UIs. This also means we can't get *retroactive* access to billing data - +if we don't explicitly set up export, we have **no** programmatic access +to billing information. And once we set up export, we will only have +access from that point onwards, not retrospectively. + +1. GCP asks you to set up [export to BigQuery](https://cloud.google.com/billing/docs/how-to/export-data-bigquery), + their sql-like big data product. In particular, we should prefer setting + up [detailed billing export](https://cloud.google.com/billing/docs/how-to/export-data-bigquery-tables/detailed-usage). + This is set up **once per billing account**, and **not once per project**. + And one billing account can export only to bigquery in one project, and + you can not easily move this table from one project to another. New + data is written into this once a day, at 5AM Pacific Time. + +2. AWS asks you to set up [export to an S3 bucket](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html), + where CSV files are produced once a day. While we *could* then import + this [into AWS Athena](https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html) + and query it, directly dealing with S3 is likely to be simpler for + our use cases. This data is also updated daily, once enabled. + +3. Azure is unknown territory for us here, as we have only supported Azure + for communities that bring their own credits and do their own cost + management. + +Our automated systems for billing need to read from these 'exported' +data sources, and we need to develop processes for making sure that +we *do* have the billing data exports enabled correctly. \ No newline at end of file diff --git a/docs/topic/billing/index.md b/docs/topic/billing/index.md new file mode 100644 index 0000000000..1320fad675 --- /dev/null +++ b/docs/topic/billing/index.md @@ -0,0 +1,11 @@ +# Cloud Billing + +These topics provide us foundational common shared knowledge and +terminology to talk about cloud billing. + +```{toctree} +:maxdepth: 2 +things +accounts +tools +``` \ No newline at end of file diff --git a/docs/topic/billing/things.md b/docs/topic/billing/things.md new file mode 100644 index 0000000000..440553a785 --- /dev/null +++ b/docs/topic/billing/things.md @@ -0,0 +1,610 @@ +# What exactly do cloud providers charge us for? + +There are a million ways to pay cloud providers money, and trying to +understand it all is [several](https://www.oreilly.com/library/view/cloud-finops/9781492054610/) +[books](https://www.porchlightbooks.com/product/cloud-cost-a-complete-guide---2020-edition--gerardus-blokdyk?variationCode=9780655917908) +[worth](https://www.abebooks.com/servlet/BookDetailsPL?bi=31001470420&dest=usa) [of](https://www.thriftbooks.com/w/cloud-data-centers-and-cost-modeling-a-complete-guide-to-planning-designing-and-building-a-cloud-data-center_caesar-wu_rajkumar-buyya/13992190/item/26408356/?srsltid=AfmBOootnd77xklpoo2MTy8n0np1b5oamDo5KgOg9dCD-0bKody2zEF14oU#idiq=26408356&edition=14835620) +[material](https://bookshop.org/p/books/reduce-cloud-computing-cost-101-ideas-to-save-millions-in-public-cloud-spending-abhinav-mittal/10266848). +However, a lot of JupyterHub (+ optionally Dask) clusters operate in +similar ways, so we can focus on a smaller subset of ways in which cloud +companies charge us. + +This document is designed not to be comprehensive, but to provide a +baseline understanding of *things in the cloud that cost money* so we can +have a baseline to reason from. + +## Nodes / Virtual Machines + +The unit of *compute* in the cloud is a virtual machine (a 'node' in +kubernetes terminology). It is assembled from the following +components: + +1. CPU (count and architecture) +2. Memory (just size) +3. Ephemeral disk space +4. Optional accelerators (like GPUs, TPUs, network accelerators, etc) + +Let's go through each of these components, and then talk about how they +get combined. + +### CPU + +CPU specification for a node is three components: + +1. Architecture ([x86_64 / amd64](https://en.wikipedia.org/wiki/X86-64) or [arm64 or aarch64](https://en.wikipedia.org/wiki/AArch64)) +2. Make (Intel vs AMD vs cloud provide made ARM processor, what year the CPU was released, etc) +3. Count (total number of processors in the node) + +The 'default' everyone assumes is x86_64 for architecture, and some sort of +recent intel processor for make. So if someone says 'I want 4 CPUs', we can +currently assume they just mean they want 4 x86_64 Intel CPUs. x86_64 intel +CPUs are also often the most expensive - AMD CPUs are cheaper, ARM CPUs are +cheaper still. So it's a possible place to make cost differences - *but*, +some software may not work correctly on different CPUs. While ARM support +has gotten a lot better recently, it's still not on par with x86_64 support, +particularly in scientific software - see [this issue](https://github.com/pangeo-data/pangeo-docker-images/issues/396) +for an example. AMD CPUs, being x86_64, are easier to switch to for cost +savings, but may still cause issues - see [this blog post](https://words.yuvi.in/post/pre-compiling-julia-docker/) +for how the specifics of what CPU are being used may affect some software. So +these cost savings are possible, but require work and testing. + +Count thus becomes the most common determiner of how much the CPU component +of a node costs for our usage pattern. + +### Memory + +Unlike CPU, memory is simpler. A GB is a GB! We pay for how much GB the +VM we request has, regardless of whether we use it or not. + +### (Ephemeral) Disk + +Each node needs some disk space that lives only as long as the node is up, +to store: + +1. The basic underlying OS that runs the node itself (Ubuntu, [COS](https://cloud.google.com/container-optimized-os/docs) or similar) +2. Storage for any docker images that need to be pulled and used on the node +3. Temporary storage for any changes made to the *running containers* in a + pod. For example, if a user runs `pip install` or `conda install` in a + user server, that will install additional packages into the container + *temporarily*. This will use up space in the node's disk as long as the + container is running. +4. Temporary storage for logs +5. Any [emptyDir](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir) + volumes in use on the node. These volumes are deleted when the pod using + them dies, so are temporary. But while they are in use, there needs to + be 'some place' for this directory to live, and it is on the node's + ephemeral disk +6. For *ephemeral hubs*, their home directories are also stored using the + same mechanism as (3) - so they take up space on the ephemeral disk as + well. Most hubs have a *persistent* home directory set up, detailed below. + + +These disks can be fairly small - 100GB at the largest, but we can get away +with far smaller disks most of the time. The performance of these disks +primarily matters *only* during initial docker image pull time - faster the +disk, faster it is to pull large docker images. But it can be really +expensive to make the base disks super fast SSDs, so often a balanced +middling performance tier is more appropriate. + +### Accelerators (GPU, etc) + +Accelerators are a form of [co-processor](https://en.wikipedia.org/wiki/Coprocessor) - +dedicated hardware that is more limited in functionality than a general +purpose CPU, but much, *much* faster. [GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit) +are the most common ones in use today, but there are also [TPU](https://cloud.google.com/tpu?hl=en)s, +[FGPA](https://en.wikipedia.org/wiki/Field-programmable_gate_array)s +available in the cloud. Depending on the cloud provider, these can either +be attached to *any* node type (GCP), or available only on a special set +of node types & sizes (AWS & Azure). + +GPUs are the most commonly used with JupyterHubs, and are often the most +expensive as well! Leaving a GPU running for days accidentally is the +second easiest way to get a huge AWS bill (you'll meet NAT / network egress, the primary culprit, later in this document). So they are usually segregated +out into their own node pool that is spawned only for servers that require +GPU use, and shut down after. + +### Combining resource sizes when making a node + +While you can often make 'custom' node sizes, *most commonly* you pick +one from a predefined list offered by the cloud provider ([GCP](https://cloud.google.com/compute/docs/machine-resource), +[AWS](https://aws.amazon.com/ec2/instance-types/), +[Azure](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes)). +These machine types are primarily pre-defined sets of CPU (of a particular +generation, architecture and count), RAM and optionally +an accelerator (like GPU). + +For the kinds of workloads we have, usually we pick something with a high +memory to CPU ratio, as notebooks (and dask workers) tend to be mostly +memory bound, not CPU bound. + +### Autoscaling makes nodes temporary + +Nodes in our workloads are are *temporary* - they come and go with user +needs, and get replaced when version upgrades happen. They have +names like `gke-jup-test-default-pool-47300cca-wk01` - notice +the random string of characters towards the end. The [cluster autoscaler](https://github.com/kubernetes/autoscaler/) +is the primary reason for a new node to come into existence or disappear, +along with ocassional manual node version upgrades. + +Since nodes are often the biggest cloud cost for communities, and the cluster +autoscaler is the biggest determiner of how / when nodes are used, +understanding its behavior is very important in understanding cloud costs. + +Documentation: [GCP](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler), [AWS](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md), [Azure](https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler?tabs=azure-cli) + +## Home directory + +By default, we provide users with a persistent, POSIX compatible fileystem mounted +under `$HOME`. This persists across user server restarts, and is used to store +code (and sometimes data). + +Because this has to persist regardless of wether the user is currently active or +not, cloud providers will charge us for it regardless of wether the user is actively +using it or not. Hence, choosing what cloud products we use here becomes crucial - if +we pick something that is a poor fit, storage costs can easily dwarf everything else. + +The default from z2jh is to use [Dynamic Volume Provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/), +which sets up a *separate, dedicated* block storage device (think of it like a hard drive) +(([EBS](https://aws.amazon.com/ebs/) on AWS, [Persistent Disk](https://cloud.google.com/persistent-disk?hl=en) on GCP and [Disk Storage](https://azure.microsoft.com/en-us/products/storage/disks) on Azure)) +*per user*. This has the following advantages: + +1. No single user can take over the entire drive or use up extremely large amounts of + space, as the max size is enforced by the cloud provider. +2. Performance is pretty fast, since the disks are dedicated to a single user. +3. It's incredibly easy to set up, as kubernetes takes care of everything for us. + +However, it has some massive disadvantages: + +1. The variance in how much users use their persistent directory is *huge*. Our observation is that *most* users + use only a little bit of storage, and a few users use a lot of storage. By allocating the *same* amount of disk + space to each user, a lot of it is wasted, unused space. But, we **pay** for all this unused space! Sometimes this + means we're paying for about 90% of disk space that's unused! + + +That single cost related disadvantage is big enough that we **do not** use the default home directory set up +at all! Instead, we use some sort of *shared* storage - a *single* storage device is provisioned for *all* the +users on a hub, and then *subdirectories* are mounted via [NFS](https://en.wikipedia.org/wiki/Network_File_System). +This is more complicated and offers lower performance, but the *cost model* of a pooled disk is *so much cheaper* +that it is worth it. + +In the past, we have done this by managing an NFS *server* manually. While cheaper on some cloud providers, +it requires specialized knowledge and maintenance. Instead, we use the following products on various +cloud providers: + +1. [Elastic Filesystem](https://aws.amazon.com/efs/) (EFS) on AWS + + There is no upper limit on size, and we are charged *only for what we use*. Most ideal shared storage product. + +2. [Google Filestore](https://cloud.google.com/filestore?hl=en) on GCP + + We have to specify a maximum storage capacity, **which can never be reduced**, only increased. We pay for the maximum storage capacity, regardless of what we use. + + The smallest filestore we can provision is 1TiB, which is overkill for many of our communities. But it's the best we can do on GCP. + +3. [Azure Files](https://azure.microsoft.com/en-us/products/storage/files) on Azure + + We have to specify a maximum storage capacity, which can be adjusted up or down. We pay for the maximum storage capacity, regardless of what we use. + + + +## Network Fees + +Back in the old days, 'networking' implied literally running physical +cables between multiple computers, having an [RJ45 criping tool](https://serverfault.com/questions/1048792/how-does-the-rj45-crimping-tool-work) +in your pocket, and being able to put physical tags on cables so you +could keep track of where the other end of the cable was plugged into. Changing +network configuration often meant literally unplugging the configuration +in which physical wires were plugged from one device into another, manually +logging into specific network devices and changing configuration. + +When doing things 'in the cloud', while all these physical tasks still +need to be performed, we the users of the cloud are not privy to any of that. +Instead, we operate in the world of [software defined networking](https://www.cloudflare.com/learning/network-layer/what-is-sdn/) (or SDN). +The fundamental idea here is that instead of having to phsycially connect +cables around, you can define network configuration in software somehow, +and it 'just works'. Without SDN, instead of just creating two VMs in the +same network from a web UI and expecting them to communicate with each other, +you would have to email support, and someone will have to go physically plug +in a cable from somewhere to somewhere else. Instead of being annoyed that +'new VM creation' takes 'as long as 10 minutes', you would be *happy* that +it only took 2 weeks. + +Software Defined Networking is one of the core methodologies that make the +cloud as we know it possible. + +Almost **every interaction in the cloud involves transmitting data over the network**. +Depending on the source and destination of this transmission, **we may get charged for it**, +so it's important to have a good mental model of how network costs work when +thinking about cloud costs. + +While the overall topic of network costs in the cloud is complex, thankfully +we can get away with a simplified model when operating just JupyterHubs. + +### Network Boundaries + +There are two primary aspects that determine how much a network +transmission costs: + +1. The total amount of data transmitted +2. The **network boundaries** it crosses + +(1) is fairly intuitive - the more data that is transmitted, the +more it costs. Usually, this is measured in 'GiB of data transmitted', +rather than smaller units. + +(2) is determined by the **network boundaries** a data packet has to +cross from source to destination. Network boundaries can be categorized +in a million ways. But given we primarily want to model network costs, +we can consider a boundary to be crossed when: + +1. The source and destination are in different zones of the same region of the cloud provider +2. The source and destination are in different regions of the same cloud provider +3. One of the source or destination is *outside* the cloud provider + +If the source and destination are within the same zone, no network cost +boundaries are crossed and usually this costs no money. As a general rule, +when each boundary gets crossed, the cost of the data transmission +goes up. Within a cloud provider, transmission between regions is often +priced based on the distance between them - australia to +the US is more expensive than canada to the US. Data transmission from within a cloud provider to outside is the most expensive, and is also often +metered by geographical location. + +This also applies to data present in any *cloud services* as well, although +*sometimes* in those cases transmission is free even in a different zone in +the same region. For example, with [AWS S3 data transfer](https://aws.amazon.com/s3/pricing/) (under the 'Data Transfer' tab), accessing +S3 buckets **in the same region** is free regardless of what zone you are in. +But, with [GCP Filestore](https://cloud.google.com/filestore/pricing#network_pricing_for_traffic_between_the_client_and), you +*do* pay network fees if you are mounting it from a different zone even if +it is in the same region. So as a baseline, we should assume that any +traffic crossing network *zone* boundaries will be charged, unless explicitly +told otherwise. + +Primarily, this means we should design and place services as close to each +other as possible to reduce cost. + +### Ingress and Egress fees + +Cloud providers in general don't want you to leave them, and tend to +enforce this lock-in by charging an [extremely high network egress cost](https://blog.cloudflare.com/aws-egregious-egress). +Like, 'bankrupt your customer because they tried to move a bucket of data from +one cloud provider to another' high egress cost. Cloudflare estimates that +in US data centers, it's [about 80x](https://blog.cloudflare.com/aws-egregious-egress) what +it actually costs the big cloud providers. + +So when people are 'scared' of cloud costs, it's **most often** network +egress costs. It's very easy to make a bucket public, have a single other person +try to access it from another cloud provider, and get charged [a lot of money](https://2i2c.freshdesk.com/a/tickets/1105). + +While scary, it's also fairly easy for us to watch out for. +A core use of JupyterHub is to put the compute near where the data is, so +*most likely* we are simply not going to see any significant egress fees +at all, as we are accessing data in the same region / zone. Object storage +access is the primary point of contention here for the kind of infrastructure +we manage, so as long as we are careful about that we should be ok. More +information about managing this is present in the object storage section of +this guide. + + +## Object storage + +[Object Storage](https://en.wikipedia.org/wiki/Object_storage) is one of the +most important paradigm shifts of the 'cloud' revolution. Instead of data +being accessed as *files* via POSIX system calls (like `read`, `write`, etc), +data is stored as *objects* and accessed via HTTP methods over the network +(like `GET`, `PUT`, etc). This allows for *much* cheaper storage, and much +higher scalability at the cost of a little more complexity (as all data +access becomes a network call). While home directories may store users' code, +the data is either currently already being stored in object storage, or is +goint to be stored that way in the near future. + +While object storage has been around since the 90s, it was heavily +popularized by the introduction of [AWS S3](https://aws.amazon.com/s3/) in +2006. A lot of modern object storage systems are thus described as 'S3 +compatible', meaning that any code that can access data from S3 can also +access data from these systems automatically. + +### Storage fees, operation fees and class + +When storing data in object storage, you have to select a *storage class* +first. This determines both the *latency* of your data access as well as +cost, along with some additional constraints (like minimum storage duration) +for some classes. While this can be set on a *per object* basis, usually it +is set at the *bucket* level instead. + +Once you have picked your storage class, the cost associated is down to +the **amount of data stored** and the **number of operations performed** on +these objects. Amount of data stored is simple - you pay a $ amount per GB +of data stored per month. + +There's a cost associated with each *operation* (GET, PUT, LIST, +etc) performed against them. Mutating and listing operations usually cost an order of magnitude more than simple read operations. But overall the fees +are not that high - on AWS, it's $0.0004 for every 1,000 read requests +and $0.005 for every 1,000 write requests. Given the patterns in which most JupyterHubs operate, this is not a big source of fees. + +The cheaper it is +to *store*, the more expensive it is to *access*. The following table of costs +from S3 illustrates this - and this is not *all* the classes available! + +| Class | Storage Fee | Access Fee | +| - | - | - | +| Standard | $0.023 / GB| $0.0004 / 1,000 req | +| Standard - Infrequent Access | $0.0125 / GB | $0.001 / 1,000 req | +| Glacier Deep Archive | $0.00099 / GB | $(0.0004 + 0.10) / 1,000 req + $0.02 per GB | + +Picking the right storage class based on usage pattern is hence important +to optimizing cost. This is generally true for most cloud providers. +A full guide to picking storage class is out of scope for this document +though. + +Pricing link: [AWS S3](https://aws.amazon.com/s3/pricing/), [GCP GCS](https://cloud.google.com/storage/pricing), [Azure Blob Storage](https://azure.microsoft.com/en-us/pricing/details/storage/blobs/) + +### Egress fees + +As discussed in the network fees section, *egress* fees are how cloud +providers often lock you in. They want you to store all your data in them, +particularly in their object stores, and charge an inordinate amount of money +to take that data with you somewhere else. This is egregious enough now that +[regulators are noticing](https://www.cnbc.com/2023/10/05/amazon-and-microsofts-cloud-dominance-referred-for-uk-competition-probe.html), +and that is probably the long term solution. + +In the meantime, the answer is fairly simple - just **never** expose your +object store buckets to the public internet! Treat them as 'internal' storage +only. JupyterHubs should be put in the same region as the +object store buckets they need to access, so this access can be free of cost, +even avoiding the cross-region charges. + +If you *must* expose object store data to the public internet, consider +[getting an OSN Pod](https://www.openstoragenetwork.org/) (NSF funded). They +provide fully compatible object stores with zero egress fees and fairly good +performance. And perhaps, if you live in a rich enough country, let your +regulator know this is an anti-competitive practice that needs to end. + +## LoadBalancers + +Each cluster uses a *single* kubernetes service of [type LoadBalancer](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) +to get any HTTP(S) traffic into the cluster. This creates a single +"Load Balancer" in the cloud provider, providing us with a static IP (GCP & Azure) or CNAME (AWS) to direct our traffic into. + +After traffic gets into +the cluster, it is routed to various places via kubernetes [ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) +objects, backed by +the community maintained [ingress-nginx](https://github.com/kubernetes/ingress-nginx) +provider (*not* the VC backed [nginx-ingress](https://docs.nginx.com/nginx-ingress-controller/) provider). So users access JupyterHub, +Grafana, as well as any other service through this single endpoint. + +Each Load Balancer costs money as long as it exists, and there is a per GB +processing charge as well. Since we don't really have a lot of data coming +*in* to the JupyterHub (as only user sessions are exposed via the browser), +the per GB charge usually doesn't add up to much (even uploading 1 terabyte of data (which will be very slow) will only cost between 5 - 14 USD). + +1. [AWS ELB](https://aws.amazon.com/elasticloadbalancing/pricing/?nc=sn&loc=3) + + We currently us [Classic Load Balancers](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/introduction.html) as that is the kubernetes + default. Pricing varies by region, but is mostly $0.025 per hour + $0.008 + per GB. + +2. [GCP Cloud LoadBalancing](https://cloud.google.com/vpc/network-pricing#lb) + + Pricing varies by region, but is about $0.0327 per hour per load balancer + (listed as 'forwarding rule' in the pricing page, as our load balancers only have one forwarding rule each) + + $0.01046 per GB processed. + +3. [Azure LoadBalancer](https://azure.microsoft.com/en-us/pricing/details/load-balancer/) + + Pricing varies by region, but is about $0.025 per hour per load balancer + (listed as a 'rule' in the pricing page, as our load balancers only have one rule each) + $$0.005 per GB processed. + +## Access to the external internet + +Users on JupyterHubs generally expect to be able to access the external +internet, rather than just internal services. This requires one of two +solutions: + +1. Each **node** of our kubernetes cluster gets an **external IP**, so + each node has direct access to the external internet. +2. A single [Network Address Translation](https://en.wikipedia.org/wiki/Network_address_translation) (NAT) + device is set up for the cluster, and all external traffic goes through + this. + +There are cost (and sometimes quota) considerations for either option. + +### Public IP per node + +If each node gets a public IP, any traffic *to* the external internet looks +like it is coming from the node. This has two cost considerations: + +1. There is **no charge** for any data coming in on any of the cloud providers, + as it is counted purely as 'data ingress'. So if a user is downloading + terabytes of data from the public internet, it's almost completely free. + Sending data out still costs money (as metered by 'data egress'), but + since our usage pattern is more likely to download massive data than + upload it, data egress doesn't amount to much in this context. + +2. On some cloud providers, there is a public IP fee that adds to the + cost of *each* node. This used to be free, but started costing money + only recently (starting in about 2023). + + | Cloud Provider | Pricing | Link | + | - | - | - | + | GCP | $0.005 / hr for regular instances, $0.003 / hr for pre-emptible instances | [link](https://cloud.google.com/vpc/network-pricing#ipaddress) | + | AWS | $0.005 / hr | [link](https://aws.amazon.com/vpc/pricing/) (see 'Public IPv4 Addresses' tab) + | Azure | $0.004 / hr | [link](https://azure.microsoft.com/en-us/pricing/details/ip-addresses/) (Dynamic IPv4 Address for "Basic (ARM)" | + +This is the *default* mode of operation for our clusters, and it mostly works +ok! We pay for each node, but do not have to worry about users racking up big +bills because they downloaded a couple terabytes of data. Since the originating IP of each network request is the *node* the user pod is on, +this also helps with external services (such as on-prem NASA data +repositories) that throttle requests based on *originating IP* - all users +on a hub are less likely to be penalized due to one user exhausting this +quota. + +There are two possible negatives when using this: + +1. Each of our nodes is directly exposed to the public internet, which means + theoretically our attack surface is larger. However, these are all + cloud managed kubernetes nodes without any real ports open to the public, + so not a particularly large risk factor. +2. Some clouds have a separate quota system for public IPs, so this may + limit the total size of our kubernetes cluster even if we have other + quotas. + +### Cloud NAT + +The alternative to having one public IP per node is to use a [Network Address Translation](https://en.wikipedia.org/wiki/Network_address_translation) +(NAT) service. All external traffic from all nodes will go out via this +single NAT service, which usually provides a single (or a pool of) IP addresses +that this traffic looks to originate from. Our kubernetes nodes are not +exposed to the public internet at all, which theoretically adds another +layer of security as well. + +![Dr. Who Meme about Cloud NAT](https://hackmd.io/_uploads/H1IYevYi6.png) + +(one of the many memes about Cloud NAT being expensive, from [QuinnyPig](https://twitter.com/QuinnyPig/status/1357391731902341120). Many seem far more violent. See [this post](https://www.lastweekinaws.com/blog/the-aws-managed-nat-gateway-is-unpleasant-and-not-recommended/) for more information.) + +However, it is the **single most expensive** thing one can do on any cloud +provider, and must be avoided at all costs. Instead of data *ingress* +being free, it becomes pretty incredibly expensive. + +| Cloud Provider | Cost | Pricing Link | +| - | - | - | +| GCP | $(0.044 + 005) per hour + $0.045 per GiB | [link](https://cloud.google.com/nat/pricing) | +| AWS | $0.045 per hour + $0.045 per GiB | [link](https://aws.amazon.com/vpc/pricing/) | +| Azure | $0.045 per hour + $0.045 per GiB | [link](https://azure.microsoft.com/en-us/pricing/details/azure-nat-gateway/) | + +Data *ingress* costs go from zero to $0.045 per GiB. So a user could download +a terabyte of data, and it costs about $46 instead of 0$. This can escalate +real quick, because this cost is *invisible* to the user. + +The shared IP also causes issues with external services that throttle per IP. + +So in general **we should completely avoid using NATs**, unless forced to by +external constraints - primarily, some organizations might force a 'no public +IPs on nodes' policy. While this may be useful policy in cases where individual +users are creating individual VMs to do random things, it is counterproductive +in our set up - and only makes things expensive. + +## Other Persistent Disk Storage + +We use dedicated, always on block storage for the following parts of +the infrastructure: + +1. The hub database + + This stores the `sqlite3` database that the JupyterHub + process itself uses for storing user information, server information, etc. + Since some of our hubs rely on admins explicitly adding users to the + hub database via the hub admin, this disk is important to preserve. + + It is usually 1GiB in size, since that is the smallest disk we can get + from many of these cloud providers. However, on *Azure*, we also store + the hub logs on this disk for now - so the disk size is much larger. + + High performance is not a consideration here, so we usually try to use + the cheapest (hence slowest disk) possible. + +2. Prometheus disk + + We use [prometheus](https://prometheus.io/) to store time series data, + both for operational (debugging) as well as reporting (usage metrics) + purposes. We retain 1 year of data, and that requires some disk space. + We try to not use the slowest disk type, but this doesn't require the + fastest one either. + + Currently, most prometheus disks are bigger than they need to be. We + can slim them down with effort if needed. + +3. Grafana disk + + Grafana also uses a sqlite database the same way JupyterHub does. While + most of the dashboards we use are [managed in an external repo](https://github.com/jupyterhub/grafana-dashboards), + the database still holds user information as well as manually set up + dashboards. This does not need to be fast, and we also try to have the + smallest possible disk here (1GiB) + +4. Optional per-user disk for dedicated databases + + One of the features we offer is a database per user, set up as a sidecar. + Most databases do not want to use NFS as the backing disk for storing + their data, and need a dedicated block storage device. One is created + *for each user* (via [Kubernetes Dynamic Volume Provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/)), + and this can get quite expensive + (for reasons outlined in the "Home Directory Storage" section above). + +Pricing is primarily dependent on the expected disk performance (measured as throughput and IOPS) as well as the disk size. While there's a lot of +options available, we'll list what we most commonly use in this table. + +| Cloud Provider | Cost | Pricing Page | +| - | - | - | +| GCP | Standard (Slow) - $0.048 / GB / month, Balanced (Most common) - $0.12 / GB / month | [link](https://cloud.google.com/compute/disks-image-pricing#disk) | +| AWS | gp2 (most common) - $0.1 / GB / month | [link](https://aws.amazon.com/ebs/pricing/) | +| Azure | ??? | [link](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) | + + +## Kubernetes Cluster management + +Cloud providers charge for maintaining the Kubernetes cluster itself, +separate from the nodes used to run various things in it. This is usually a *fixed* per hour cost per cluster. + +1. [AWS EKS](https://aws.amazon.com/eks/pricing/) + + Costs $0.10 per hour flat. + +2. [GCP GKE](https://cloud.google.com/kubernetes-engine/pricing#standard_edition) + + Is a bit messier - it costs \$0.10 per hour, but is structured in such a way + that one single-zone (non highly available) cluster is free per GCP project. + While this is a cost saving of about 75$ per month, in most cases we + use *regional* (multi-zonal) clusters that are not free. This makes + master upgrades much more painfree, and reduces overall amount of downtime + the hub has. So our default here is to pay the 75$ a month for the + regional cluster. + + GKE also has an 'autopilot' edition that we have tried to use in the + past and decided against using, for various restrictions it has on + what kind of workloads it can run. + +3. [Azure AKS](https://azure.microsoft.com/en-us/pricing/details/kubernetes-service/) + + Our clusters are currently on the *Free* tier, so don't cost any money. + Their documentation suggests this is not recommended for clusters with + more than 10 nodes, but we've generally not had any problems with it + so far. + +## Image Registry + +We primarily have people use [quay.io](https://quay.io) to store their +docker images, so usually it doesn't cost anything. However, when we have +dynamic image building enabled, we use the native container image registry +in the cloud provider we use. Usually they charge for storage based on size, +and we will have to eventually develop methods to clean unused images out +of the registry. Exposing built images publicly will subject us to big +network egress fees, so we usually do not make the built images available +anywhere outside of the region / zone in which they are built. + +Pricing links: [GCP](https://cloud.google.com/artifact-registry/pricing), [AWS](https://aws.amazon.com/ecr/pricing/), [Azure](https://azure.microsoft.com/en-us/pricing/details/container-registry/) + +## Logs + +Logs are incredibly helpful when debugging issues, as they tell us exactly +what happened and when. The problem usually is that there's a *lot* of logs, +and they need to be stored & indexed in a way that is helpful for a human. +All the cloud providers offer some (possibly expensive) way to store and +index logs, although whether it is actually helpful for a human is up for +debate. The following resources produce logs constantly: + +1. The Kubernetes control plane itself (all API requests, node joinings, etc) +2. The various system processes running on each node ([kernel ring buffer](https://man7.org/linux/man-pages/man1/dmesg.1.html), + container runtime, mount errors / logs, systemd logs, etc) +3. Kubernetes components on each node (`kubelet`, various container runtimes, etc) +4. All kubernetes pods (we access these with `kubectl logs` while they are active, but + the logging system is required to store them for use after the pod has been deleted) +5. [Kubernetes Events](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/) +6. The Cluster Autoscaler (to help us understand autoscaling decisions) +7. Probably more. If you think 'this thing should have some logs', it probably does, + if you can find it. + +Google cloud has [StackDriver](https://cloud.google.com/products/operations?hl=en), +which is automatically already pre-configured to collect logs from all our clusters. +AWS supports [CloudWatch](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) +for logs, but they aren't enabled by default. Azure probably has at least one product +for logs. diff --git a/docs/topic/billing/tools.md b/docs/topic/billing/tools.md new file mode 100644 index 0000000000..e0682b9dca --- /dev/null +++ b/docs/topic/billing/tools.md @@ -0,0 +1,58 @@ +# Cloud billing tools + +There are a few tools we can use to make cloud billing easier to understand, +reason about and use. + +## Using tags to separate costs + +Cloud billing reports by default allow us to see how much each **service** +was charged for - for example, we could know that VMs cost us 589\$. +But wouldn't it be nice to know how much of that 589\$ was for core nodes, +how much was for user nodes, and how much for dask workers? Or +hypothetically, if we had multiple profiles that were accessible only to +certain subsets of users, which profiles had cost how much money? + +This is actually possible if we use **cost tags**, which can be attached +to *most* cloud resources (**not** kubernetes resources). Once attached +to specific sets of resources, you can filter and group by them in the +web reporting UI or in programmatically in the billing export. This will +count *all* costs emnating from any resource tagged with that tag. For +example, if a node is tagged with a particular tag, the following separate +things will all have that tag associated: + +1. The amount of memory allocated to that node +2. The amount of CPU allocated to the node +3. (If tagged correctly) The base disk allocated to that node +4. Any network costs accured by actions of processes on that node (subject + to some complexity) + +However, they have quite a few limitations as well: + +1. They are attached to *cloud* resources, *not* kubernetes resources. By + default, multiple users can be on a single node, so deeper granularity + of cost attribution is not possible. +2. Some cloud resources are shared to reduce cost (home directories are a + prime example), and detailed cost attribution there needs to be done + via other means. +3. This does not apply retroactively, you must already have the tags set up + for costs to be accured to them. + +Despite these limitations, cost tagging is extremely important to get a good +understanding of what costs how much money. With a well tagged system, we +can make reasoned trade-offs about cost rather than fiddling in the dark. + +Resource links: [GCP Labels](https://cloud.google.com/compute/docs/labeling-resources), [AWS Cost Allocation Tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html), [Azure Tags](https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/enable-tag-inheritance) + +## Budget alerts + +All cloud providers allow us to set up alerts that fire whenever a particular +project is about to cost more than a specific amount, or is forecast to go +cost over a specific amount by end of the month if current trends continue. +This can be *extremely helpful* in assuaging communities of cost overruns, +but requires we have a prediction for *what numbers* to set these budgets at, +as well as what to do when the alerts fire. Usually these alerts can be +set up manually in the UI or (preferably) via terraform. We currently don't +utilize these, but we really should! + +More information: [GCP](https://cloud.google.com/billing/docs/how-to/budgets), [AWS](https://aws.amazon.com/aws-cost-management/aws-budgets/) +and [Azure](https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-alerts-monitor-usage-spending). \ No newline at end of file From ed36bfc5973ae2ed89c4faecb45e843587feca71 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Wed, 6 Mar 2024 05:54:00 +0000 Subject: [PATCH 02/36] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/topic/billing/things.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/topic/billing/things.md b/docs/topic/billing/things.md index 440553a785..cecd46b66a 100644 --- a/docs/topic/billing/things.md +++ b/docs/topic/billing/things.md @@ -124,7 +124,7 @@ needs, and get replaced when version upgrades happen. They have names like `gke-jup-test-default-pool-47300cca-wk01` - notice the random string of characters towards the end. The [cluster autoscaler](https://github.com/kubernetes/autoscaler/) is the primary reason for a new node to come into existence or disappear, -along with ocassional manual node version upgrades. +along with occasional manual node version upgrades. Since nodes are often the biggest cloud cost for communities, and the cluster autoscaler is the biggest determiner of how / when nodes are used, @@ -134,7 +134,7 @@ Documentation: [GCP](https://cloud.google.com/kubernetes-engine/docs/concepts/cl ## Home directory -By default, we provide users with a persistent, POSIX compatible fileystem mounted +By default, we provide users with a persistent, POSIX compatible filesystem mounted under `$HOME`. This persists across user server restarts, and is used to store code (and sometimes data). From 18891b339bda27b83c3dcbcf1411568c7f401d0b Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Tue, 5 Mar 2024 21:55:41 -0800 Subject: [PATCH 03/36] Fix some typos --- docs/topic/billing/things.md | 4 ++-- docs/topic/billing/tools.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/topic/billing/things.md b/docs/topic/billing/things.md index cecd46b66a..67a5d9a781 100644 --- a/docs/topic/billing/things.md +++ b/docs/topic/billing/things.md @@ -138,8 +138,8 @@ By default, we provide users with a persistent, POSIX compatible filesystem moun under `$HOME`. This persists across user server restarts, and is used to store code (and sometimes data). -Because this has to persist regardless of wether the user is currently active or -not, cloud providers will charge us for it regardless of wether the user is actively +Because this has to persist regardless of whether the user is currently active or +not, cloud providers will charge us for it regardless of whether the user is actively using it or not. Hence, choosing what cloud products we use here becomes crucial - if we pick something that is a poor fit, storage costs can easily dwarf everything else. diff --git a/docs/topic/billing/tools.md b/docs/topic/billing/tools.md index e0682b9dca..8204596d55 100644 --- a/docs/topic/billing/tools.md +++ b/docs/topic/billing/tools.md @@ -23,7 +23,7 @@ things will all have that tag associated: 1. The amount of memory allocated to that node 2. The amount of CPU allocated to the node 3. (If tagged correctly) The base disk allocated to that node -4. Any network costs accured by actions of processes on that node (subject +4. Any network costs accrued by actions of processes on that node (subject to some complexity) However, they have quite a few limitations as well: @@ -35,7 +35,7 @@ However, they have quite a few limitations as well: prime example), and detailed cost attribution there needs to be done via other means. 3. This does not apply retroactively, you must already have the tags set up - for costs to be accured to them. + for costs to be accrued to them. Despite these limitations, cost tagging is extremely important to get a good understanding of what costs how much money. With a well tagged system, we From 692f4fd93f10edb592b33a034a5ad9c8560d2901 Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Wed, 6 Mar 2024 09:03:19 -0800 Subject: [PATCH 04/36] Grammar, syntax and clarity fixes Co-authored-by: Jenny Wong Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com> --- docs/topic/billing/accounts.md | 28 ++++++++++++++-------------- docs/topic/billing/things.md | 25 +++++++++++++++---------- docs/topic/billing/tools.md | 21 ++++++++++----------- 3 files changed, 39 insertions(+), 35 deletions(-) diff --git a/docs/topic/billing/accounts.md b/docs/topic/billing/accounts.md index 8eb9066aff..b2b82b277a 100644 --- a/docs/topic/billing/accounts.md +++ b/docs/topic/billing/accounts.md @@ -20,18 +20,18 @@ be attached to a single billing account, and projects can usually move between billing accounts. All cloud resources (clusters, nodes, object storage, etc) are always contained inside a Project. -Billing accounts are generally not created often. Usually you would need to +Billing accounts are generally not created often. Usually, you would need to create a new one for the following reasons: 1. You are getting cloud credits from somewhere, and want a separate container - for it so you can track spend accurately. + for it so you can track spending accurately. 2. You want to use a different credit card than what you use for your other billing account. ## AWS -AWS billing is a little more haphazard than GCP's, and derived partially -from the well established monstrosity that is [LDAP](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol). +AWS billing is a little more haphazard than GCP's and is derived partially +from the well-established monstrosity that is [LDAP](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol). We will not dig too deep into it, lest we wake some demons. However, since we are primarily concerned with billing, understanding the following concepts @@ -53,7 +53,7 @@ Each AWS Account can have billing attached in one of two ways: part of. This is known as [consolidated billing](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/consolidated-billing.html) So billing information is *always* associated with an AWS Account, -*not* AWS Organization. This is a historical artifact - AWS Organizations +*not* an AWS Organization. This is a historical artefact - AWS Organizations were only announced [in 2016](https://aws.amazon.com/about-aws/whats-new/2016/11/announcing-aws-organizations-now-in-preview/), a good 9 years after AWS started. So if this feels a little hacked on, it is because it is! @@ -70,13 +70,13 @@ who need to handle billing. We currently have no experience or knowledge here, as all our Azure customers handle billing themselves. This is not uncommon - most Azure users want to use it because they already have a pre-existing *strong* -relationship with Microsoft, and their billing gets managed as part of +relationship with Microsoft and their billing gets managed as part of that. ## Billing reports in the web console Cloud providers provide a billing console accessible via the web, often -with very helpful reports. While not automation friendly (see next section +with very helpful reports. While not automation-friendly (see next section for more information on automation), this is very helpful for doing ad-hoc reporting as well as trying to optimize cloud costs. For dedicated clusters, this may also be directly accessible to hub champions, allowing them to @@ -84,8 +84,8 @@ explore their own usage. ### GCP -You can get a list of all billing accounts you have access to [here](https://console.cloud.google.com/billing) -(make sure you are logged in to the correct google account). Alternatively, +You can get a list of all billing accounts you have access to via the [billing console](https://console.cloud.google.com/billing) +(make sure you are logged in to the correct Google account). Alternatively, you can select 'Billing' from the left sidebar in the Google Cloud Console after selecting the correct project. @@ -120,9 +120,9 @@ access to any Azure account! ## Programmatic access to billing information The APIs for getting access to billing data *somehow* seem to be the -least well developed parts of any cloud vendor's offerings. Usually +least well-developed parts of any cloud vendor's offerings. Usually, they take the form of a 'billing export' - you set up a *destination* -where information about billing is written, often once a day. And then +where information about billing is written, often once a day. Then you query this location for billing information. This is in sharp contrast to most other cloud APIs, where the cloud vendor has a *single source of truth* you can just query. Not so for billing - there's no external API @@ -136,9 +136,9 @@ access from that point onwards, not retrospectively. their sql-like big data product. In particular, we should prefer setting up [detailed billing export](https://cloud.google.com/billing/docs/how-to/export-data-bigquery-tables/detailed-usage). This is set up **once per billing account**, and **not once per project**. - And one billing account can export only to bigquery in one project, and + And one billing account can export only to BigQuery in one project, and you can not easily move this table from one project to another. New - data is written into this once a day, at 5AM Pacific Time. + data is written into this once a day, at 5 AM Pacific Time. 2. AWS asks you to set up [export to an S3 bucket](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html), where CSV files are produced once a day. While we *could* then import @@ -151,5 +151,5 @@ access from that point onwards, not retrospectively. management. Our automated systems for billing need to read from these 'exported' -data sources, and we need to develop processes for making sure that +data sources, and we need to develop processes to make sure that we *do* have the billing data exports enabled correctly. \ No newline at end of file diff --git a/docs/topic/billing/things.md b/docs/topic/billing/things.md index 67a5d9a781..d6fe04c54f 100644 --- a/docs/topic/billing/things.md +++ b/docs/topic/billing/things.md @@ -1,10 +1,13 @@ # What exactly do cloud providers charge us for? There are a million ways to pay cloud providers money, and trying to -understand it all is [several](https://www.oreilly.com/library/view/cloud-finops/9781492054610/) -[books](https://www.porchlightbooks.com/product/cloud-cost-a-complete-guide---2020-edition--gerardus-blokdyk?variationCode=9780655917908) -[worth](https://www.abebooks.com/servlet/BookDetailsPL?bi=31001470420&dest=usa) [of](https://www.thriftbooks.com/w/cloud-data-centers-and-cost-modeling-a-complete-guide-to-planning-designing-and-building-a-cloud-data-center_caesar-wu_rajkumar-buyya/13992190/item/26408356/?srsltid=AfmBOootnd77xklpoo2MTy8n0np1b5oamDo5KgOg9dCD-0bKody2zEF14oU#idiq=26408356&edition=14835620) -[material](https://bookshop.org/p/books/reduce-cloud-computing-cost-101-ideas-to-save-millions-in-public-cloud-spending-abhinav-mittal/10266848). +understand it all is several books worth of material.[^1][^2][^3][^4][^5] + +[^1]: https://www.oreilly.com/library/view/cloud-finops/9781492054610/ +[^2]: https://www.porchlightbooks.com/product/cloud-cost-a-complete-guide---2020-edition--gerardus-blokdyk?variationCode=9780655917908 +[^3]: https://www.abebooks.com/servlet/BookDetailsPL?bi=31001470420&dest=usa +[^4]: https://www.thriftbooks.com/w/cloud-data-centers-and-cost-modeling-a-complete-guide-to-planning-designing-and-building-a-cloud-data-center_caesar-wu_rajkumar-buyya/13992190/item/26408356/?srsltid=AfmBOootnd77xklpoo2MTy8n0np1b5oamDo5KgOg9dCD-0bKody2zEF14oU#idiq=26408356&edition=14835620 +[^5]: https://bookshop.org/p/books/reduce-cloud-computing-cost-101-ideas-to-save-millions-in-public-cloud-spending-abhinav-mittal/10266848 However, a lot of JupyterHub (+ optionally Dask) clusters operate in similar ways, so we can focus on a smaller subset of ways in which cloud companies charge us. @@ -375,7 +378,7 @@ Grafana, as well as any other service through this single endpoint. Each Load Balancer costs money as long as it exists, and there is a per GB processing charge as well. Since we don't really have a lot of data coming *in* to the JupyterHub (as only user sessions are exposed via the browser), -the per GB charge usually doesn't add up to much (even uploading 1 terabyte of data (which will be very slow) will only cost between 5 - 14 USD). +the per GB charge usually doesn't add up to much (even uploading 1 terabyte of data, which will be very slow, will only cost between 5 - 14 USD). 1. [AWS ELB](https://aws.amazon.com/elasticloadbalancing/pricing/?nc=sn&loc=3) @@ -392,7 +395,7 @@ the per GB charge usually doesn't add up to much (even uploading 1 terabyte of d 3. [Azure LoadBalancer](https://azure.microsoft.com/en-us/pricing/details/load-balancer/) Pricing varies by region, but is about $0.025 per hour per load balancer - (listed as a 'rule' in the pricing page, as our load balancers only have one rule each) + $$0.005 per GB processed. + (listed as a 'rule' in the pricing page, as our load balancers only have one rule each) + $0.005 per GB processed. ## Access to the external internet @@ -457,9 +460,11 @@ that this traffic looks to originate from. Our kubernetes nodes are not exposed to the public internet at all, which theoretically adds another layer of security as well. -![Dr. Who Meme about Cloud NAT](https://hackmd.io/_uploads/H1IYevYi6.png) +```{image} https://hackmd.io/_uploads/H1IYevYi6.png) +:alt: Dr. Who Meme about Cloud NAT -(one of the many memes about Cloud NAT being expensive, from [QuinnyPig](https://twitter.com/QuinnyPig/status/1357391731902341120). Many seem far more violent. See [this post](https://www.lastweekinaws.com/blog/the-aws-managed-nat-gateway-is-unpleasant-and-not-recommended/) for more information.) +One of the many memes about Cloud NAT being expensive, from [QuinnyPig](https://twitter.com/QuinnyPig/status/1357391731902341120). Many seem far more violent. See [this post](https://www.lastweekinaws.com/blog/the-aws-managed-nat-gateway-is-unpleasant-and-not-recommended/) for more information. +``` However, it is the **single most expensive** thing one can do on any cloud provider, and must be avoided at all costs. Instead of data *ingress* @@ -472,7 +477,7 @@ being free, it becomes pretty incredibly expensive. | Azure | $0.045 per hour + $0.045 per GiB | [link](https://azure.microsoft.com/en-us/pricing/details/azure-nat-gateway/) | Data *ingress* costs go from zero to $0.045 per GiB. So a user could download -a terabyte of data, and it costs about $46 instead of 0$. This can escalate +a terabyte of data, and it costs about $46 instead of $0. This can escalate real quick, because this cost is *invisible* to the user. The shared IP also causes issues with external services that throttle per IP. @@ -551,7 +556,7 @@ separate from the nodes used to run various things in it. This is usually a *fix 2. [GCP GKE](https://cloud.google.com/kubernetes-engine/pricing#standard_edition) - Is a bit messier - it costs \$0.10 per hour, but is structured in such a way + Is a bit messier - it costs $0.10 per hour, but is structured in such a way that one single-zone (non highly available) cluster is free per GCP project. While this is a cost saving of about 75$ per month, in most cases we use *regional* (multi-zonal) clusters that are not free. This makes diff --git a/docs/topic/billing/tools.md b/docs/topic/billing/tools.md index 8204596d55..deefe6d6e8 100644 --- a/docs/topic/billing/tools.md +++ b/docs/topic/billing/tools.md @@ -10,13 +10,13 @@ was charged for - for example, we could know that VMs cost us 589\$. But wouldn't it be nice to know how much of that 589\$ was for core nodes, how much was for user nodes, and how much for dask workers? Or hypothetically, if we had multiple profiles that were accessible only to -certain subsets of users, which profiles had cost how much money? +certain subsets of users, which profiles would cost how much money? This is actually possible if we use **cost tags**, which can be attached -to *most* cloud resources (**not** kubernetes resources). Once attached +to *most* cloud resources (**not** Kubernetes resources). Once attached to specific sets of resources, you can filter and group by them in the -web reporting UI or in programmatically in the billing export. This will -count *all* costs emnating from any resource tagged with that tag. For +web reporting UI or programmatically in the billing export. This will +count *all* costs emanating from any resource tagged with that tag. For example, if a node is tagged with a particular tag, the following separate things will all have that tag associated: @@ -28,7 +28,7 @@ things will all have that tag associated: However, they have quite a few limitations as well: -1. They are attached to *cloud* resources, *not* kubernetes resources. By +1. They are attached to *cloud* resources, *not* Kubernetes resources. By default, multiple users can be on a single node, so deeper granularity of cost attribution is not possible. 2. Some cloud resources are shared to reduce cost (home directories are a @@ -38,7 +38,7 @@ However, they have quite a few limitations as well: for costs to be accrued to them. Despite these limitations, cost tagging is extremely important to get a good -understanding of what costs how much money. With a well tagged system, we +understanding of what and how much costs money. With a well-tagged system, we can make reasoned trade-offs about cost rather than fiddling in the dark. Resource links: [GCP Labels](https://cloud.google.com/compute/docs/labeling-resources), [AWS Cost Allocation Tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html), [Azure Tags](https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/enable-tag-inheritance) @@ -46,12 +46,11 @@ Resource links: [GCP Labels](https://cloud.google.com/compute/docs/labeling-reso ## Budget alerts All cloud providers allow us to set up alerts that fire whenever a particular -project is about to cost more than a specific amount, or is forecast to go -cost over a specific amount by end of the month if current trends continue. -This can be *extremely helpful* in assuaging communities of cost overruns, +project is about to cost more than a specific amount, or is forecast to go over a specific amount by the end of the month if current trends continue. +This can be *extremely helpful* in assuaging communities of cost overruns but requires we have a prediction for *what numbers* to set these budgets at, -as well as what to do when the alerts fire. Usually these alerts can be -set up manually in the UI or (preferably) via terraform. We currently don't +as well as what to do when the alerts fire. Usually, these alerts can be +set up manually in the UI or (preferably) via Terraform. We currently don't utilize these, but we really should! More information: [GCP](https://cloud.google.com/billing/docs/how-to/budgets), [AWS](https://aws.amazon.com/aws-cost-management/aws-budgets/) From 4de54e8623ba2bdb498ec57f9985b5686073ddab Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Wed, 6 Mar 2024 10:04:28 -0800 Subject: [PATCH 05/36] Rename things.md to be more descriptive --- docs/topic/billing/{things.md => chargeable-resources.md} | 0 docs/topic/billing/index.md | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/topic/billing/{things.md => chargeable-resources.md} (100%) diff --git a/docs/topic/billing/things.md b/docs/topic/billing/chargeable-resources.md similarity index 100% rename from docs/topic/billing/things.md rename to docs/topic/billing/chargeable-resources.md diff --git a/docs/topic/billing/index.md b/docs/topic/billing/index.md index 1320fad675..450d54f85a 100644 --- a/docs/topic/billing/index.md +++ b/docs/topic/billing/index.md @@ -5,7 +5,7 @@ terminology to talk about cloud billing. ```{toctree} :maxdepth: 2 -things +chargeable-resources accounts tools ``` \ No newline at end of file From c2c14d4dc9e34afc4208324d4f92165c44ed7667 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Wed, 6 Mar 2024 10:17:19 -0800 Subject: [PATCH 06/36] Use reference style links instead of footnotes Makes the markdown easier while still bringing across the point about many books stronger --- docs/topic/billing/chargeable-resources.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index d6fe04c54f..ce3f80c3ff 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -1,13 +1,8 @@ # What exactly do cloud providers charge us for? There are a million ways to pay cloud providers money, and trying to -understand it all is several books worth of material.[^1][^2][^3][^4][^5] +understand it all is [several][book-1] [books][book-2] [worth][book-3] [of][book-4] [material][book-5]. -[^1]: https://www.oreilly.com/library/view/cloud-finops/9781492054610/ -[^2]: https://www.porchlightbooks.com/product/cloud-cost-a-complete-guide---2020-edition--gerardus-blokdyk?variationCode=9780655917908 -[^3]: https://www.abebooks.com/servlet/BookDetailsPL?bi=31001470420&dest=usa -[^4]: https://www.thriftbooks.com/w/cloud-data-centers-and-cost-modeling-a-complete-guide-to-planning-designing-and-building-a-cloud-data-center_caesar-wu_rajkumar-buyya/13992190/item/26408356/?srsltid=AfmBOootnd77xklpoo2MTy8n0np1b5oamDo5KgOg9dCD-0bKody2zEF14oU#idiq=26408356&edition=14835620 -[^5]: https://bookshop.org/p/books/reduce-cloud-computing-cost-101-ideas-to-save-millions-in-public-cloud-spending-abhinav-mittal/10266848 However, a lot of JupyterHub (+ optionally Dask) clusters operate in similar ways, so we can focus on a smaller subset of ways in which cloud companies charge us. @@ -613,3 +608,10 @@ which is automatically already pre-configured to collect logs from all our clust AWS supports [CloudWatch](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) for logs, but they aren't enabled by default. Azure probably has at least one product for logs. + + +[book-1]: https://www.oreilly.com/library/view/cloud-finops/9781492054610/ +[book-2]: https://www.porchlightbooks.com/product/cloud-cost-a-complete-guide---2020-edition--gerardus-blokdyk?variationCode=9780655917908 +[book-3]: https://www.abebooks.com/servlet/BookDetailsPL?bi=31001470420&dest=usa +[book-4]: https://www.thriftbooks.com/w/cloud-data-centers-and-cost-modeling-a-complete-guide-to-planning-designing-and-building-a-cloud-data-center_caesar-wu_rajkumar-buyya/13992190/item/26408356/?srsltid=AfmBOootnd77xklpoo2MTy8n0np1b5oamDo5KgOg9dCD-0bKody2zEF14oU#idiq=26408356&edition=14835620 +[book-5]: https://bookshop.org/p/books/reduce-cloud-computing-cost-101-ideas-to-save-millions-in-public-cloud-spending-abhinav-mittal/10266848 From aff61f4c5e1dcaf9ce6e161fc23fd60faf7d82a2 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Wed, 6 Mar 2024 10:22:59 -0800 Subject: [PATCH 07/36] Cross-link sections in the guide better --- docs/topic/billing/chargeable-resources.md | 24 +++++++++++++--------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index ce3f80c3ff..072f58286f 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -74,7 +74,8 @@ to store: ephemeral disk 6. For *ephemeral hubs*, their home directories are also stored using the same mechanism as (3) - so they take up space on the ephemeral disk as - well. Most hubs have a *persistent* home directory set up, detailed below. + well. Most hubs have a *persistent* home directory set up, detailed + in the [Home directories section](topic:billing:resources:home). These disks can be fairly small - 100GB at the largest, but we can get away @@ -130,6 +131,7 @@ understanding its behavior is very important in understanding cloud costs. Documentation: [GCP](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler), [AWS](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md), [Azure](https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler?tabs=azure-cli) +(topic:billing:resources:home)= ## Home directory By default, we provide users with a persistent, POSIX compatible filesystem mounted @@ -184,7 +186,7 @@ cloud providers: We have to specify a maximum storage capacity, which can be adjusted up or down. We pay for the maximum storage capacity, regardless of what we use. - +(topic:billing:resources:network-fees)= ## Network Fees Back in the old days, 'networking' implied literally running physical @@ -279,10 +281,10 @@ A core use of JupyterHub is to put the compute near where the data is, so at all, as we are accessing data in the same region / zone. Object storage access is the primary point of contention here for the kind of infrastructure we manage, so as long as we are careful about that we should be ok. More -information about managing this is present in the object storage section of -this guide. - +information about managing this is present in the [object storage section](topic:billing:resources:object-storage) +of this guide. +(topic:billing:resources:object-storage)= ## Object storage [Object Storage](https://en.wikipedia.org/wiki/Object_storage) is one of the @@ -338,11 +340,13 @@ Pricing link: [AWS S3](https://aws.amazon.com/s3/pricing/), [GCP GCS](https://cl ### Egress fees -As discussed in the network fees section, *egress* fees are how cloud +As discussed in the [network fees +section](topic:billing:resources:network-fees), *egress* fees are how cloud providers often lock you in. They want you to store all your data in them, -particularly in their object stores, and charge an inordinate amount of money -to take that data with you somewhere else. This is egregious enough now that -[regulators are noticing](https://www.cnbc.com/2023/10/05/amazon-and-microsofts-cloud-dominance-referred-for-uk-competition-probe.html), +particularly in their object stores, and charge an inordinate amount of money to +take that data with you somewhere else. This is egregious enough now that +[regulators are +noticing](https://www.cnbc.com/2023/10/05/amazon-and-microsofts-cloud-dominance-referred-for-uk-competition-probe.html), and that is probably the long term solution. In the meantime, the answer is fairly simple - just **never** expose your @@ -528,7 +532,7 @@ the infrastructure: their data, and need a dedicated block storage device. One is created *for each user* (via [Kubernetes Dynamic Volume Provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/)), and this can get quite expensive - (for reasons outlined in the "Home Directory Storage" section above). + (for reasons outlined in the ["Home Directory Storage" section](topic:billing:resources:home)). Pricing is primarily dependent on the expected disk performance (measured as throughput and IOPS) as well as the disk size. While there's a lot of options available, we'll list what we most commonly use in this table. From 9ff5a90f8115316b92a451fc236525fdc7c9b9d6 Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Wed, 6 Mar 2024 07:44:47 +0100 Subject: [PATCH 08/36] basehub: apply jupyter-server-proxy vulnerability patch dynamically --- helm-charts/basehub/values.yaml | 133 +++++++++++++++++++++++++++++++- 1 file changed, 131 insertions(+), 2 deletions(-) diff --git a/helm-charts/basehub/values.yaml b/helm-charts/basehub/values.yaml index 56e3ade80a..276537baa3 100644 --- a/helm-charts/basehub/values.yaml +++ b/helm-charts/basehub/values.yaml @@ -277,8 +277,10 @@ jupyterhub: mountPath: /home/jovyan/shared subPath: _shared cmd: - # Explicitly define this, as it's no longer set by z2jh - # https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/2449 + # Mitigate a vulnerability in jupyter-server-proxy version <4.1.1, see + # https://github.com/jupyterhub/jupyter-server-proxy/security/advisories/GHSA-w3vc-fx9p-wp4v + # for more details. + - /mnt/ghsa-w3vc-fx9p-wp4v/check-patch-run - jupyterhub-singleuser extraEnv: # notebook writes secure files that don't need to survive a @@ -291,6 +293,133 @@ jupyterhub: # Most people do not expect this, so let's match expectation SHELL: /bin/bash extraFiles: + ghsa-w3vc-fx9p-wp4v-check-patch-run: + mountPath: /mnt/ghsa-w3vc-fx9p-wp4v/check-patch-run + mode: 0755 + stringData: | + #!/usr/bin/env python3 + """ + This script is designed to check for and conditionally patch GHSA-w3vc-fx9p-wp4v + in user servers started by a JupyterHub. The script will execute any command + passed via arguments if provided, allowing it to wrap a user server startup call + to `jupyterhub-singleuser` for example. + + Script assumptions: + - its run before the jupyter server starts + + Script adjustments: + - UPGRADE_IF_VULNERABLE + - ERROR_IF_VULNERABLE + + Read more at https://github.com/jupyterhub/jupyter-server-proxy/security/advisories/GHSA-w3vc-fx9p-wp4v. + """ + + import os + import subprocess + import sys + + # adjust these to meet vulnerability mitigation needs + UPGRADE_IF_VULNERABLE = True + ERROR_IF_VULNERABLE = True + + + def check_vuln(): + """ + Checks for the vulnerability by looking to see if __version__ is available, + it is since 3.2.3 and 4.1.1 that and those are the first patched versions. + """ + try: + import jupyter_server_proxy + + return False if hasattr(jupyter_server_proxy, "__version__") else True + except: + return False + + + def get_version_specifier(): + """ + Returns a pip version specifier for use with `--no-deps` meant to do as + little as possible besides patching the vulnerability and remaining + functional. + """ + old = ["jupyter-server-proxy>=3.2.3,<4"] + new = ["jupyter-server-proxy>=4.1.1,<5", "simpervisor>=1,<2"] + + try: + if sys.version_info < (3, 8): + return old + + from importlib.metadata import version + + jsp_version = version("jupyter-server-proxy") + if int(jsp_version.split(".")[0]) < 4: + return old + except: + pass + return new + + + def patch_vuln(): + """ + Attempts to patch the vulnerability by upgrading jupyter-server-proxy using + pip. + """ + # attempt upgrade via pip, takes ~4 seconds + proc = subprocess.run( + [sys.executable, "-m", "pip", "--version"], + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + ) + pip_available = proc.returncode == 0 + if pip_available: + proc = subprocess.run( + [sys.executable, "-m", "pip", "install", "--no-deps"] + + get_version_specifier() + ) + if proc.returncode == 0: + return True + + + def main(): + if check_vuln(): + warning_or_error = ( + "ERROR" if ERROR_IF_VULNERABLE and not UPGRADE_IF_VULNERABLE else "WARNING" + ) + print( + f"{warning_or_error}: jupyter-server-proxy __is vulnerable__ to GHSA-w3vc-fx9p-wp4v, see " + "https://github.com/jupyterhub/jupyter-server-proxy/security/advisories/GHSA-w3vc-fx9p-wp4v.", + flush=True, + ) + if warning_or_error == "ERROR": + sys.exit(1) + + if UPGRADE_IF_VULNERABLE: + print( + "INFO: Attempting to upgrade jupyter-server-proxy using pip...", + flush=True, + ) + if patch_vuln(): + print( + "INFO: Attempt to upgrade jupyter-server-proxy succeeded!", + flush=True, + ) + else: + warning_or_error = "ERROR" if ERROR_IF_VULNERABLE else "WARNING" + print( + f"{warning_or_error}: Attempt to upgrade jupyter-server-proxy failed!", + flush=True, + ) + if warning_or_error == "ERROR": + sys.exit(1) + + if len(sys.argv) >= 2: + print("INFO: Executing provided command", flush=True) + os.execvp(sys.argv[1], sys.argv[1:]) + else: + print("INFO: No command to execute provided", flush=True) + + + main() ipython_kernel_config.json: mountPath: /usr/local/etc/ipython/ipython_kernel_config.json data: From c1194601e7b48193b2f3c90ee2b00c67f1fd7e91 Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Thu, 7 Mar 2024 10:11:01 +0100 Subject: [PATCH 09/36] basehub: opt out erroring if vulnerable initially --- helm-charts/basehub/values.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/helm-charts/basehub/values.yaml b/helm-charts/basehub/values.yaml index 276537baa3..6e52d12d01 100644 --- a/helm-charts/basehub/values.yaml +++ b/helm-charts/basehub/values.yaml @@ -320,7 +320,7 @@ jupyterhub: # adjust these to meet vulnerability mitigation needs UPGRADE_IF_VULNERABLE = True - ERROR_IF_VULNERABLE = True + ERROR_IF_VULNERABLE = False def check_vuln(): From dbfd90ebe95819bedf3a955f64d4133af04eda95 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Fri, 8 Mar 2024 11:56:13 -0800 Subject: [PATCH 10/36] Consistently use $ in front of numbers to indicate price --- docs/topic/billing/chargeable-resources.md | 2 +- docs/topic/billing/tools.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index 072f58286f..32697967fd 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -377,7 +377,7 @@ Grafana, as well as any other service through this single endpoint. Each Load Balancer costs money as long as it exists, and there is a per GB processing charge as well. Since we don't really have a lot of data coming *in* to the JupyterHub (as only user sessions are exposed via the browser), -the per GB charge usually doesn't add up to much (even uploading 1 terabyte of data, which will be very slow, will only cost between 5 - 14 USD). +the per GB charge usually doesn't add up to much (even uploading 1 terabyte of data, which will be very slow, will only cost between $5 - $14). 1. [AWS ELB](https://aws.amazon.com/elasticloadbalancing/pricing/?nc=sn&loc=3) diff --git a/docs/topic/billing/tools.md b/docs/topic/billing/tools.md index deefe6d6e8..72b8a7a092 100644 --- a/docs/topic/billing/tools.md +++ b/docs/topic/billing/tools.md @@ -6,8 +6,8 @@ reason about and use. ## Using tags to separate costs Cloud billing reports by default allow us to see how much each **service** -was charged for - for example, we could know that VMs cost us 589\$. -But wouldn't it be nice to know how much of that 589\$ was for core nodes, +was charged for - for example, we could know that VMs cost us $589. +But wouldn't it be nice to know how much of that $589 was for core nodes, how much was for user nodes, and how much for dask workers? Or hypothetically, if we had multiple profiles that were accessible only to certain subsets of users, which profiles would cost how much money? From d277922273a5869f341a5f0d19eaf519296d433a Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Wed, 13 Mar 2024 13:25:48 +0100 Subject: [PATCH 11/36] Sync with official patch script with Sarah's docstring fix --- helm-charts/basehub/values.yaml | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/helm-charts/basehub/values.yaml b/helm-charts/basehub/values.yaml index 6e52d12d01..7891a2ef3d 100644 --- a/helm-charts/basehub/values.yaml +++ b/helm-charts/basehub/values.yaml @@ -304,13 +304,15 @@ jupyterhub: passed via arguments if provided, allowing it to wrap a user server startup call to `jupyterhub-singleuser` for example. - Script assumptions: - - its run before the jupyter server starts - Script adjustments: - UPGRADE_IF_VULNERABLE - ERROR_IF_VULNERABLE + Script patching assumptions: + - script is run before the jupyter server starts + - pip is available + - pip has sufficient filesystem permissions to upgrade jupyter-server-proxy + Read more at https://github.com/jupyterhub/jupyter-server-proxy/security/advisories/GHSA-w3vc-fx9p-wp4v. """ @@ -325,8 +327,8 @@ jupyterhub: def check_vuln(): """ - Checks for the vulnerability by looking to see if __version__ is available, - it is since 3.2.3 and 4.1.1 that and those are the first patched versions. + Checks for the vulnerability by looking to see if __version__ is available + as it coincides with the patched versions (3.2.3 and 4.1.1). """ try: import jupyter_server_proxy From 43c0850162096e979b1fac8ed70522be3d1eaaa6 Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Wed, 13 Mar 2024 17:38:17 -0700 Subject: [PATCH 12/36] Edit for clarity Co-authored-by: Erik Sundell --- docs/topic/billing/chargeable-resources.md | 22 ++++++++++------------ docs/topic/billing/tools.md | 4 ++-- 2 files changed, 12 insertions(+), 14 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index 32697967fd..85db10c1d8 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -19,7 +19,7 @@ components: 1. CPU (count and architecture) 2. Memory (just size) -3. Ephemeral disk space +3. Boot disk (size and type, ephemeral) 4. Optional accelerators (like GPUs, TPUs, network accelerators, etc) Let's go through each of these components, and then talk about how they @@ -54,7 +54,7 @@ of a node costs for our usage pattern. Unlike CPU, memory is simpler. A GB is a GB! We pay for how much GB the VM we request has, regardless of whether we use it or not. -### (Ephemeral) Disk +### Boot disk (ephemeral) Each node needs some disk space that lives only as long as the node is up, to store: @@ -80,7 +80,7 @@ to store: These disks can be fairly small - 100GB at the largest, but we can get away with far smaller disks most of the time. The performance of these disks -primarily matters *only* during initial docker image pull time - faster the +*primarily* matters during initial docker image pull time - faster the disk, faster it is to pull large docker images. But it can be really expensive to make the base disks super fast SSDs, so often a balanced middling performance tier is more appropriate. @@ -92,9 +92,7 @@ dedicated hardware that is more limited in functionality than a general purpose CPU, but much, *much* faster. [GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit) are the most common ones in use today, but there are also [TPU](https://cloud.google.com/tpu?hl=en)s, [FGPA](https://en.wikipedia.org/wiki/Field-programmable_gate_array)s -available in the cloud. Depending on the cloud provider, these can either -be attached to *any* node type (GCP), or available only on a special set -of node types & sizes (AWS & Azure). +available in the cloud. Accelerators can often only be attached to a subset of a cloud provider's available node types, node sizes, and cloud zones. GPUs are the most commonly used with JupyterHubs, and are often the most expensive as well! Leaving a GPU running for days accidentally is the @@ -239,7 +237,7 @@ we can consider a boundary to be crossed when: 1. The source and destination are in different zones of the same region of the cloud provider 2. The source and destination are in different regions of the same cloud provider -3. One of the source or destination is *outside* the cloud provider +3. Either the source or destination is *outside* the cloud provider If the source and destination are within the same zone, no network cost boundaries are crossed and usually this costs no money. As a general rule, @@ -295,7 +293,7 @@ data is stored as *objects* and accessed via HTTP methods over the network higher scalability at the cost of a little more complexity (as all data access becomes a network call). While home directories may store users' code, the data is either currently already being stored in object storage, or is -goint to be stored that way in the near future. +going to be stored that way in the near future. While object storage has been around since the 90s, it was heavily popularized by the introduction of [AWS S3](https://aws.amazon.com/s3/) in @@ -387,9 +385,9 @@ the per GB charge usually doesn't add up to much (even uploading 1 terabyte of d 2. [GCP Cloud LoadBalancing](https://cloud.google.com/vpc/network-pricing#lb) - Pricing varies by region, but is about $0.0327 per hour per load balancer - (listed as 'forwarding rule' in the pricing page, as our load balancers only have one forwarding rule each) - + $0.01046 per GB processed. + Pricing varies by region, but is about $0.0327 per hour per load balancer + (listed as 'forwarding rule' in the pricing page, as our load balancers only have one forwarding rule each) + + $0.01046 per GB processed. 3. [Azure LoadBalancer](https://azure.microsoft.com/en-us/pricing/details/load-balancer/) @@ -446,7 +444,7 @@ There are two possible negatives when using this: theoretically our attack surface is larger. However, these are all cloud managed kubernetes nodes without any real ports open to the public, so not a particularly large risk factor. -2. Some clouds have a separate quota system for public IPs, so this may +2. Some clouds have a separate quota system for public IPs (GCP, Azure), so this may limit the total size of our kubernetes cluster even if we have other quotas. diff --git a/docs/topic/billing/tools.md b/docs/topic/billing/tools.md index 72b8a7a092..de908c8013 100644 --- a/docs/topic/billing/tools.md +++ b/docs/topic/billing/tools.md @@ -12,13 +12,13 @@ how much was for user nodes, and how much for dask workers? Or hypothetically, if we had multiple profiles that were accessible only to certain subsets of users, which profiles would cost how much money? -This is actually possible if we use **cost tags**, which can be attached +This is actually possible if we use **tags to track cost**, which can be attached to *most* cloud resources (**not** Kubernetes resources). Once attached to specific sets of resources, you can filter and group by them in the web reporting UI or programmatically in the billing export. This will count *all* costs emanating from any resource tagged with that tag. For example, if a node is tagged with a particular tag, the following separate -things will all have that tag associated: +things will be assoiated with the tag: 1. The amount of memory allocated to that node 2. The amount of CPU allocated to the node From e4db0fc68513921e21dcc5f5cbc0ef07468a85c6 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 14 Mar 2024 00:38:31 +0000 Subject: [PATCH 13/36] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/topic/billing/tools.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/topic/billing/tools.md b/docs/topic/billing/tools.md index de908c8013..67252d43d8 100644 --- a/docs/topic/billing/tools.md +++ b/docs/topic/billing/tools.md @@ -18,7 +18,7 @@ to specific sets of resources, you can filter and group by them in the web reporting UI or programmatically in the billing export. This will count *all* costs emanating from any resource tagged with that tag. For example, if a node is tagged with a particular tag, the following separate -things will be assoiated with the tag: +things will be associated with the tag: 1. The amount of memory allocated to that node 2. The amount of CPU allocated to the node From c51906e0c0a415618af5c1d2d0d9594a6fef4e1f Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Thu, 14 Mar 2024 23:19:32 +0100 Subject: [PATCH 14/36] Let patch_vuln function return False instead of None Co-authored-by: Georgiana --- helm-charts/basehub/values.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/helm-charts/basehub/values.yaml b/helm-charts/basehub/values.yaml index 7891a2ef3d..d8ee75ef90 100644 --- a/helm-charts/basehub/values.yaml +++ b/helm-charts/basehub/values.yaml @@ -380,6 +380,7 @@ jupyterhub: ) if proc.returncode == 0: return True + return False def main(): From 5f825da2bad9179ea63f9ecc1fba64287272275a Mon Sep 17 00:00:00 2001 From: Silva Alejandro Ismael Date: Sat, 16 Mar 2024 15:49:42 -0300 Subject: [PATCH 15/36] catalyst-africa: add wits --- .../catalystproject-africa/cluster.yaml | 8 +++++ .../enc-wits.secret.values.yaml | 20 ++++++++++++ .../catalystproject-africa/wits.values.yaml | 31 +++++++++++++++++++ 3 files changed, 59 insertions(+) create mode 100644 config/clusters/catalystproject-africa/enc-wits.secret.values.yaml create mode 100644 config/clusters/catalystproject-africa/wits.values.yaml diff --git a/config/clusters/catalystproject-africa/cluster.yaml b/config/clusters/catalystproject-africa/cluster.yaml index 3276b0b1b4..4bd86a4c8e 100644 --- a/config/clusters/catalystproject-africa/cluster.yaml +++ b/config/clusters/catalystproject-africa/cluster.yaml @@ -42,3 +42,11 @@ hubs: - common.values.yaml - uvri.values.yaml - enc-uvri.secret.values.yaml + - name: wits + display_name: "Catalyst Project, Africa - WITS" + domain: wits.af.catalystproject.2i2c.cloud + helm_chart: basehub + helm_chart_values_files: + - common.values.yaml + - wits.values.yaml + - enc-wits.secret.values.yaml diff --git a/config/clusters/catalystproject-africa/enc-wits.secret.values.yaml b/config/clusters/catalystproject-africa/enc-wits.secret.values.yaml new file mode 100644 index 0000000000..323cfacb4f --- /dev/null +++ b/config/clusters/catalystproject-africa/enc-wits.secret.values.yaml @@ -0,0 +1,20 @@ +jupyterhub: + hub: + config: + GitHubOAuthenticator: + client_id: ENC[AES256_GCM,data:pJQFrWrenwY7jHiFBGTVCMZLu5w=,iv:MLrLiqdz5iHDsKwjsw77zU9kULxdmIOmOQgjh++Cook=,tag:64gcyzERRwSKbJkplC9ukw==,type:str] + client_secret: ENC[AES256_GCM,data:Bl/t9kN0Pynlg6rTzR2PZ8OawzxYCpmYb1qXvz4M4lbQjPQht8xyuA==,iv:dwot4RZoaMRKNdF/BHytae7Vj6FVleifQcpa+XFiHaU=,tag:vQVA1olAuQ8Dk7uoGVE1Bg==,type:str] +sops: + kms: [] + gcp_kms: + - resource_id: projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs + created_at: "2024-03-16T18:43:27Z" + enc: CiUA4OM7eEOOAxlm3aaJ0/2+Bli6lMIms5g1F1McTvczHLD5E2JXEkkAXoW3JjnvcvLqZazPKNcY4tnlJ0iQEDal1Ij1TrN/eoLguy7dvBhuBctyeYbgaHBjOfQMIOq2T+UAaEF3Kqs7gM542nJjC66Q + azure_kv: [] + hc_vault: [] + age: [] + lastmodified: "2024-03-16T18:43:28Z" + mac: ENC[AES256_GCM,data:VDU1vuVnL8SJXHdnlYGW4RWGZHfA/VKI3UpdH8v8sMynHTJsNa/4jsuWMapx9/3mEwJJ4gxsZFTkQ4I2U8SH4O3oIM1SZTIyAjn8ZfKk4UQrUXYiqGR/lwJVP7XBNpIj4XFDkjMCt7K/3YwgmHiorX8mLopCjsHhNO6WAUqGCDw=,iv:Go+Y30GoGswvkFNVb3rizc/DXB8iikSs8lYBqMCm8zY=,tag:83a8GVibHY+y4Pp5OqfnOg==,type:str] + pgp: [] + unencrypted_suffix: _unencrypted + version: 3.8.1 diff --git a/config/clusters/catalystproject-africa/wits.values.yaml b/config/clusters/catalystproject-africa/wits.values.yaml new file mode 100644 index 0000000000..c692d88a69 --- /dev/null +++ b/config/clusters/catalystproject-africa/wits.values.yaml @@ -0,0 +1,31 @@ +jupyterhub: + ingress: + hosts: [wits.af.catalystproject.2i2c.cloud] + tls: + - hosts: [wits.af.catalystproject.2i2c.cloud] + secretName: https-auto-tls + custom: + 2i2c: + add_staff_user_ids_to_admin_users: true + add_staff_user_ids_of_type: "github" + jupyterhubConfigurator: + enabled: false + homepage: + templateVars: + org: + name: Catalyst Project, Africa - WITS + url: https://catalystproject.cloud/ + logo_url: https://catalystproject.cloud/_images/catalyst-icon-dark.png + hub: + config: + JupyterHub: + authenticator_class: github + GitHubOAuthenticator: + oauth_callback_url: https://wits.af.catalystproject.2i2c.cloud/hub/oauth_callback + allowed_organizations: + - CatalystProject-Hubs:wits + scope: + - read:org + Authenticator: + admin_users: + - gentlelab2016 From a209741908b726d245f3fb501c7d10f178b0d0c4 Mon Sep 17 00:00:00 2001 From: Silva Alejandro Ismael Date: Sat, 16 Mar 2024 16:00:07 -0300 Subject: [PATCH 16/36] catalyst-africa: add kush --- .../catalystproject-africa/cluster.yaml | 8 +++++ .../enc-kush.secret.values.yaml | 20 ++++++++++++ .../catalystproject-africa/kush.values.yaml | 31 +++++++++++++++++++ 3 files changed, 59 insertions(+) create mode 100644 config/clusters/catalystproject-africa/enc-kush.secret.values.yaml create mode 100644 config/clusters/catalystproject-africa/kush.values.yaml diff --git a/config/clusters/catalystproject-africa/cluster.yaml b/config/clusters/catalystproject-africa/cluster.yaml index 4bd86a4c8e..39a5dc1b73 100644 --- a/config/clusters/catalystproject-africa/cluster.yaml +++ b/config/clusters/catalystproject-africa/cluster.yaml @@ -50,3 +50,11 @@ hubs: - common.values.yaml - wits.values.yaml - enc-wits.secret.values.yaml + - name: kush + display_name: "Catalyst Project, Africa - KUSH" + domain: kush.af.catalystproject.2i2c.cloud + helm_chart: basehub + helm_chart_values_files: + - common.values.yaml + - kush.values.yaml + - enc-kush.secret.values.yaml diff --git a/config/clusters/catalystproject-africa/enc-kush.secret.values.yaml b/config/clusters/catalystproject-africa/enc-kush.secret.values.yaml new file mode 100644 index 0000000000..f93a59fec4 --- /dev/null +++ b/config/clusters/catalystproject-africa/enc-kush.secret.values.yaml @@ -0,0 +1,20 @@ +jupyterhub: + hub: + config: + GitHubOAuthenticator: + client_id: ENC[AES256_GCM,data:Gkmq+33RgGILWWMxNBBSanflwwE=,iv:hps2UpDP6V8lppuKb9RnIWLTJ0PvtpTZiRrlxpzBkJ4=,tag:aIgWTM58Co8jnS1XY+Prog==,type:str] + client_secret: ENC[AES256_GCM,data:qhC0EQqw/xMn6MoLX/Qg5490EKs1gAn2vMjbFbyjkz8MLkIyD7IPeg==,iv:qtcya1NLgQqrIVvpQjk7nD6pYopoihApH56nG+BwGds=,tag:SJXcutZaLIG0aJuTh2O9MA==,type:str] +sops: + kms: [] + gcp_kms: + - resource_id: projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs + created_at: "2024-03-16T18:56:57Z" + enc: CiUA4OM7eC6pPiGwZPvMCnBtZyxMHkHbqj83JS4/nUqeByc90ku6EkkAXoW3JmO+LiuyXhEwZw2g6VZx+pclkjt791ly42bEOuJ54S21YpSROCZbn2UtwQ6X5K2BcPlUjiGs+/YmFcXQn0SkxtP5T1/Q + azure_kv: [] + hc_vault: [] + age: [] + lastmodified: "2024-03-16T18:56:59Z" + mac: ENC[AES256_GCM,data:DvnKusaKQFCAWMPWEWPNA9EfBQQ1pBJ82hiNvGn/DIBwlHRf1E4QBV+ZiajPSdLGMSl+Fi1aPPsLYwmK7Kn/oH5zERWnzNFFMwbdobUetrz/AOExKHWztPKqj9Ckf/k5trN3Jc4B9bpvsQuj9jLP9kVVQX7L4Sv4x6OSH2U45BI=,iv:oY/fSrjgTgl7HkaoblGjfH64GYRh254oBzEoaf5o/jc=,tag:VPPCYnzbl4xa2TTTXkef6A==,type:str] + pgp: [] + unencrypted_suffix: _unencrypted + version: 3.8.1 diff --git a/config/clusters/catalystproject-africa/kush.values.yaml b/config/clusters/catalystproject-africa/kush.values.yaml new file mode 100644 index 0000000000..0233a2d787 --- /dev/null +++ b/config/clusters/catalystproject-africa/kush.values.yaml @@ -0,0 +1,31 @@ +jupyterhub: + ingress: + hosts: [kush.af.catalystproject.2i2c.cloud] + tls: + - hosts: [kush.af.catalystproject.2i2c.cloud] + secretName: https-auto-tls + custom: + 2i2c: + add_staff_user_ids_to_admin_users: true + add_staff_user_ids_of_type: "github" + jupyterhubConfigurator: + enabled: false + homepage: + templateVars: + org: + name: Catalyst Project, Africa - KUSH + url: https://catalystproject.cloud/ + logo_url: https://catalystproject.cloud/_images/catalyst-icon-dark.png + hub: + config: + JupyterHub: + authenticator_class: github + GitHubOAuthenticator: + oauth_callback_url: https://kush.af.catalystproject.2i2c.cloud/hub/oauth_callback + allowed_organizations: + - CatalystProject-Hubs:kush + scope: + - read:org + Authenticator: + admin_users: + - Fadlelmola From 516a99a91ef95c78e67a78de5627b8d35677129d Mon Sep 17 00:00:00 2001 From: Silva Alejandro Ismael Date: Sat, 16 Mar 2024 16:13:16 -0300 Subject: [PATCH 17/36] catalyst-africa: add molerhealth --- .../catalystproject-africa/cluster.yaml | 8 +++++ .../enc-molerhealth.secret.values.yaml | 20 ++++++++++++ .../molerhealth.values.yaml | 31 +++++++++++++++++++ 3 files changed, 59 insertions(+) create mode 100644 config/clusters/catalystproject-africa/enc-molerhealth.secret.values.yaml create mode 100644 config/clusters/catalystproject-africa/molerhealth.values.yaml diff --git a/config/clusters/catalystproject-africa/cluster.yaml b/config/clusters/catalystproject-africa/cluster.yaml index 39a5dc1b73..9e180e9047 100644 --- a/config/clusters/catalystproject-africa/cluster.yaml +++ b/config/clusters/catalystproject-africa/cluster.yaml @@ -58,3 +58,11 @@ hubs: - common.values.yaml - kush.values.yaml - enc-kush.secret.values.yaml + - name: molerhealth + display_name: "Catalyst Project, Africa - MolerHealth " + domain: molerhealth.af.catalystproject.2i2c.cloud + helm_chart: basehub + helm_chart_values_files: + - common.values.yaml + - molerhealth.values.yaml + - enc-molerhealth.secret.values.yaml diff --git a/config/clusters/catalystproject-africa/enc-molerhealth.secret.values.yaml b/config/clusters/catalystproject-africa/enc-molerhealth.secret.values.yaml new file mode 100644 index 0000000000..684d3a3d68 --- /dev/null +++ b/config/clusters/catalystproject-africa/enc-molerhealth.secret.values.yaml @@ -0,0 +1,20 @@ +jupyterhub: + hub: + config: + GitHubOAuthenticator: + client_id: ENC[AES256_GCM,data:9LqPZVvDDjL9yTYm7vyPTowA+uM=,iv:7q7PhCj5WuS2/UEqjKmeqY+xliv865Q3OUO78DI0ETo=,tag:dIUB4Jo3LMjb6eEvKxii0A==,type:str] + client_secret: ENC[AES256_GCM,data:y/fXvq5IlCVsP8KY4ix1/KXoaH9eJw9k3fwGbczcheHzANC1IlT6kw==,iv:CDokEvuSGOKKM7f9lUyu3kqPw3FFtkR7TGbG4QH0RIM=,tag:JXfQLCDBZ3L9h74Ld5qYtw==,type:str] +sops: + kms: [] + gcp_kms: + - resource_id: projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs + created_at: "2024-03-16T19:10:31Z" + enc: CiUA4OM7eH/kl/O91dxTRQFjbMaOGogUIJKSl5pn8Y1u9ffhNVOzEkkAXoW3JlzJ8d8nmYdsLHn5MrYSm8ikApQ415l4Zvchcom2dCGTgT3U92U/4Jk+qu8DUO1BNEH5ZyZxvAmOSWwJRvAB11c4XcJH + azure_kv: [] + hc_vault: [] + age: [] + lastmodified: "2024-03-16T19:10:32Z" + mac: ENC[AES256_GCM,data:CSjKR1oQP/Ua21HOEGGc6l1vFN41/hgTSOu7WrANAy88PzffHW+Q2M5+AASHE0udrO4E4+TKKQ7R2h1xFyGil/b3U0545CRT1imtgX7Ixin/vqLjFT/f6hGfvyQ9UM7FfDsTWbaeZylGPekuTenmsI4gz1Uwrxz1pCjSP0wJo5k=,iv:JDBCbX6740RJsvo4A06/hEhXNg+QOupkRRihsaOjfpg=,tag:XmoSINsm1cxJKP3ipySD5A==,type:str] + pgp: [] + unencrypted_suffix: _unencrypted + version: 3.8.1 diff --git a/config/clusters/catalystproject-africa/molerhealth.values.yaml b/config/clusters/catalystproject-africa/molerhealth.values.yaml new file mode 100644 index 0000000000..77928f9fd4 --- /dev/null +++ b/config/clusters/catalystproject-africa/molerhealth.values.yaml @@ -0,0 +1,31 @@ +jupyterhub: + ingress: + hosts: [molerhealth.af.catalystproject.2i2c.cloud] + tls: + - hosts: [molerhealth.af.catalystproject.2i2c.cloud] + secretName: https-auto-tls + custom: + 2i2c: + add_staff_user_ids_to_admin_users: true + add_staff_user_ids_of_type: "github" + jupyterhubConfigurator: + enabled: false + homepage: + templateVars: + org: + name: Catalyst Project, Africa - MolerHealth + url: https://catalystproject.cloud/ + logo_url: https://catalystproject.cloud/_images/catalyst-icon-dark.png + hub: + config: + JupyterHub: + authenticator_class: github + GitHubOAuthenticator: + oauth_callback_url: https://molerhealth.af.catalystproject.2i2c.cloud/hub/oauth_callback + allowed_organizations: + - CatalystProject-Hubs:molerhealth + scope: + - read:org + Authenticator: + admin_users: + - Monsurat-Onabajo From 74f21e3e7670de357ece99b2ff817c2869966cc5 Mon Sep 17 00:00:00 2001 From: Brian Freitag Date: Mon, 18 Mar 2024 11:06:11 -0500 Subject: [PATCH 18/36] Remove ieee-grss workshop github team --- config/clusters/nasa-veda/common.values.yaml | 1 - 1 file changed, 1 deletion(-) diff --git a/config/clusters/nasa-veda/common.values.yaml b/config/clusters/nasa-veda/common.values.yaml index dd8c241c07..44dc2eb07a 100644 --- a/config/clusters/nasa-veda/common.values.yaml +++ b/config/clusters/nasa-veda/common.values.yaml @@ -48,7 +48,6 @@ basehub: - CYGNSS-VEDA:cygnss-iwg - veda-analytics-access:maap-biomass-team - Earth-Information-System:eis-fire - - nasa-veda-workshops:ieee-grss-webinar-mar-2024 scope: - read:org Authenticator: From 15b751e1b7ae25b2af4266f9f60cac544df842a3 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Mon, 18 Mar 2024 14:29:06 -0700 Subject: [PATCH 19/36] Add note about EU DMA --- docs/topic/billing/chargeable-resources.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index 85db10c1d8..e33e11a6db 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -267,7 +267,11 @@ enforce this lock-in by charging an [extremely high network egress cost](https:/ Like, 'bankrupt your customer because they tried to move a bucket of data from one cloud provider to another' high egress cost. Cloudflare estimates that in US data centers, it's [about 80x](https://blog.cloudflare.com/aws-egregious-egress) what -it actually costs the big cloud providers. +it actually costs the big cloud providers. The fundamental anti-competitive nature +how egress fees can hold your data 'hostage' in a cloud provider is finally causing +regulators to crack down, but only under limited circumstances ( +[AWS](https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-internet-when-moving-out-of-aws/), +[GCP](https://cloud.google.com/blog/products/networking/eliminating-data-transfer-fees-when-migrating-off-google-cloud)). So when people are 'scared' of cloud costs, it's **most often** network egress costs. It's very easy to make a bucket public, have a single other person From 773e9db5d42f2e04a5a86f599d8b12cb0d525dc7 Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Mon, 18 Mar 2024 14:35:26 -0700 Subject: [PATCH 20/36] Fix typo Co-authored-by: Georgiana --- docs/topic/billing/chargeable-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index e33e11a6db..4b18c94864 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -383,7 +383,7 @@ the per GB charge usually doesn't add up to much (even uploading 1 terabyte of d 1. [AWS ELB](https://aws.amazon.com/elasticloadbalancing/pricing/?nc=sn&loc=3) - We currently us [Classic Load Balancers](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/introduction.html) as that is the kubernetes + We currently use [Classic Load Balancers](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/introduction.html) as that is the kubernetes default. Pricing varies by region, but is mostly $0.025 per hour + $0.008 per GB. From ef2af58d1bb6b40af152a1c5aeca9df21f93b217 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Mon, 18 Mar 2024 14:44:37 -0700 Subject: [PATCH 21/36] Clarify that GPUs are exclusively on their own pool --- docs/topic/billing/chargeable-resources.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index 4b18c94864..0fa7970b13 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -97,8 +97,10 @@ available in the cloud. Accelerators can often only be attached to a subset of a GPUs are the most commonly used with JupyterHubs, and are often the most expensive as well! Leaving a GPU running for days accidentally is the second easiest way to get a huge AWS bill (you'll meet NAT / network egress, the primary culprit, later in this document). So they are usually segregated -out into their own node pool that is spawned only for servers that require -GPU use, and shut down after. +out into their own node pool that is _exclusively_ for use by servers that need +GPUs - a [kubernetes taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) +is used to exclude all other user servers from those nodes. The cluster autoscaler +can stop the GPU node as soon as there are are actual GPU users, thus saving cost. ### Combining resource sizes when making a node From b6fc9663f6c25e6a5073f8802d2ef5efb9128bdb Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Mon, 18 Mar 2024 14:46:37 -0700 Subject: [PATCH 22/36] Clarify one drive per user --- docs/topic/billing/chargeable-resources.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index 0fa7970b13..46c15d267c 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -148,8 +148,10 @@ which sets up a *separate, dedicated* block storage device (think of it like a h (([EBS](https://aws.amazon.com/ebs/) on AWS, [Persistent Disk](https://cloud.google.com/persistent-disk?hl=en) on GCP and [Disk Storage](https://azure.microsoft.com/en-us/products/storage/disks) on Azure)) *per user*. This has the following advantages: -1. No single user can take over the entire drive or use up extremely large amounts of - space, as the max size is enforced by the cloud provider. +1. Since there is one drive per user, a user filling up their disk will only affect them - + there will be no effect on other users. This removes a failure mode associated with one + shared drive for all users - a single user can use up so much disk space that no other + users' server can start! 2. Performance is pretty fast, since the disks are dedicated to a single user. 3. It's incredibly easy to set up, as kubernetes takes care of everything for us. From 9fe545a32675f3e2931849f67dac8b31b38da33b Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Mon, 18 Mar 2024 14:49:20 -0700 Subject: [PATCH 23/36] Split billing resources out to their own page --- docs/topic/billing/accounts.md | 81 --------------------------------- docs/topic/billing/index.md | 1 + docs/topic/billing/reports.md | 82 ++++++++++++++++++++++++++++++++++ 3 files changed, 83 insertions(+), 81 deletions(-) create mode 100644 docs/topic/billing/reports.md diff --git a/docs/topic/billing/accounts.md b/docs/topic/billing/accounts.md index b2b82b277a..10f5517547 100644 --- a/docs/topic/billing/accounts.md +++ b/docs/topic/billing/accounts.md @@ -72,84 +72,3 @@ customers handle billing themselves. This is not uncommon - most Azure users want to use it because they already have a pre-existing *strong* relationship with Microsoft and their billing gets managed as part of that. - -## Billing reports in the web console - -Cloud providers provide a billing console accessible via the web, often -with very helpful reports. While not automation-friendly (see next section -for more information on automation), this is very helpful for doing ad-hoc -reporting as well as trying to optimize cloud costs. For dedicated clusters, -this may also be directly accessible to hub champions, allowing them to -explore their own usage. - -### GCP - -You can get a list of all billing accounts you have access to via the [billing console](https://console.cloud.google.com/billing) -(make sure you are logged in to the correct Google account). Alternatively, -you can select 'Billing' from the left sidebar in the Google Cloud Console -after selecting the correct project. - -Once you have selected a billing account, you can access a reporting -interface by selecting 'Reports' on the left sidebar. This lets you group -costs in different ways, and explore that over different time periods. -Grouping and filtering by project and service (representing -what kind of cloud product is costing us what) are the *most useful* aspects -here. You can also download whatever you see as a CSV, although programmatic -access to most of the reports here is unfortunately limited. - -### AWS - -For AWS, "Billing and Cost Management" is accessed from specific AWS -accounts. If using consolidated billing, it must be accessed from the -'management account' for that particular organization (see previous section -for more information). - -You can access "Billing and Cost Management" by typing "Billing and Cost -Management" into the top search bar once you're logged into the correct -AWS account. AWS has a million ways for you to slice and dice these reports, -but the most helpful place to start is the "Cost Explorer Saved Reports". -This provides 'Monthly costs by linked account' and 'Monthly costs by -service'. Once you open those reports, you can further filter and group -as you wish. A CSV of reports can also be downloaded here. - -### Azure - -We have limited expertise here, as we have never actually had billing -access to any Azure account! - -## Programmatic access to billing information - -The APIs for getting access to billing data *somehow* seem to be the -least well-developed parts of any cloud vendor's offerings. Usually, -they take the form of a 'billing export' - you set up a *destination* -where information about billing is written, often once a day. Then -you query this location for billing information. This is in sharp contrast -to most other cloud APIs, where the cloud vendor has a *single source of -truth* you can just query. Not so for billing - there's no external API -access to the various tools they seem to be able to use to provide the -web UIs. This also means we can't get *retroactive* access to billing data - -if we don't explicitly set up export, we have **no** programmatic access -to billing information. And once we set up export, we will only have -access from that point onwards, not retrospectively. - -1. GCP asks you to set up [export to BigQuery](https://cloud.google.com/billing/docs/how-to/export-data-bigquery), - their sql-like big data product. In particular, we should prefer setting - up [detailed billing export](https://cloud.google.com/billing/docs/how-to/export-data-bigquery-tables/detailed-usage). - This is set up **once per billing account**, and **not once per project**. - And one billing account can export only to BigQuery in one project, and - you can not easily move this table from one project to another. New - data is written into this once a day, at 5 AM Pacific Time. - -2. AWS asks you to set up [export to an S3 bucket](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html), - where CSV files are produced once a day. While we *could* then import - this [into AWS Athena](https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html) - and query it, directly dealing with S3 is likely to be simpler for - our use cases. This data is also updated daily, once enabled. - -3. Azure is unknown territory for us here, as we have only supported Azure - for communities that bring their own credits and do their own cost - management. - -Our automated systems for billing need to read from these 'exported' -data sources, and we need to develop processes to make sure that -we *do* have the billing data exports enabled correctly. \ No newline at end of file diff --git a/docs/topic/billing/index.md b/docs/topic/billing/index.md index 450d54f85a..58e96ec390 100644 --- a/docs/topic/billing/index.md +++ b/docs/topic/billing/index.md @@ -7,5 +7,6 @@ terminology to talk about cloud billing. :maxdepth: 2 chargeable-resources accounts +reports tools ``` \ No newline at end of file diff --git a/docs/topic/billing/reports.md b/docs/topic/billing/reports.md new file mode 100644 index 0000000000..ef9eb854ee --- /dev/null +++ b/docs/topic/billing/reports.md @@ -0,0 +1,82 @@ +# Billing reports + +## Billing reports in the web console + +Cloud providers provide a billing console accessible via the web, often +with very helpful reports. While not automation-friendly (see next section +for more information on automation), this is very helpful for doing ad-hoc +reporting as well as trying to optimize cloud costs. For dedicated clusters, +this may also be directly accessible to hub champions, allowing them to +explore their own usage. + +### GCP + +You can get a list of all billing accounts you have access to via the [billing console](https://console.cloud.google.com/billing) +(make sure you are logged in to the correct Google account). Alternatively, +you can select 'Billing' from the left sidebar in the Google Cloud Console +after selecting the correct project. + +Once you have selected a billing account, you can access a reporting +interface by selecting 'Reports' on the left sidebar. This lets you group +costs in different ways, and explore that over different time periods. +Grouping and filtering by project and service (representing +what kind of cloud product is costing us what) are the *most useful* aspects +here. You can also download whatever you see as a CSV, although programmatic +access to most of the reports here is unfortunately limited. + +### AWS + +For AWS, "Billing and Cost Management" is accessed from specific AWS +accounts. If using consolidated billing, it must be accessed from the +'management account' for that particular organization (see previous section +for more information). + +You can access "Billing and Cost Management" by typing "Billing and Cost +Management" into the top search bar once you're logged into the correct +AWS account. AWS has a million ways for you to slice and dice these reports, +but the most helpful place to start is the "Cost Explorer Saved Reports". +This provides 'Monthly costs by linked account' and 'Monthly costs by +service'. Once you open those reports, you can further filter and group +as you wish. A CSV of reports can also be downloaded here. + +### Azure + +We have limited expertise here, as we have never actually had billing +access to any Azure account! + +## Programmatic access to billing information + +The APIs for getting access to billing data *somehow* seem to be the +least well-developed parts of any cloud vendor's offerings. Usually, +they take the form of a 'billing export' - you set up a *destination* +where information about billing is written, often once a day. Then +you query this location for billing information. This is in sharp contrast +to most other cloud APIs, where the cloud vendor has a *single source of +truth* you can just query. Not so for billing - there's no external API +access to the various tools they seem to be able to use to provide the +web UIs. This also means we can't get *retroactive* access to billing data - +if we don't explicitly set up export, we have **no** programmatic access +to billing information. And once we set up export, we will only have +access from that point onwards, not retrospectively. + +1. GCP asks you to set up [export to BigQuery](https://cloud.google.com/billing/docs/how-to/export-data-bigquery), + their sql-like big data product. In particular, we should prefer setting + up [detailed billing export](https://cloud.google.com/billing/docs/how-to/export-data-bigquery-tables/detailed-usage). + This is set up **once per billing account**, and **not once per project**. + And one billing account can export only to BigQuery in one project, and + you can not easily move this table from one project to another. New + data is written into this once a day, at 5 AM Pacific Time. + +2. AWS asks you to set up [export to an S3 bucket](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html), + where CSV files are produced once a day. While we *could* then import + this [into AWS Athena](https://docs.aws.amazon.com/cur/latest/userguide/cur-query-athena.html) + and query it, directly dealing with S3 is likely to be simpler for + our use cases. This data is also updated daily, once enabled. + +3. Azure is unknown territory for us here, as we have only supported Azure + for communities that bring their own credits and do their own cost + management. + +Our automated systems for billing need to read from these 'exported' +data sources, and we need to develop processes to make sure that +we *do* have the billing data exports enabled correctly. \ No newline at end of file From 0dda71458a1ad40353830f5238c96225c81ebf9a Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Mon, 18 Mar 2024 14:55:26 -0700 Subject: [PATCH 24/36] Clarify GKE wording Co-authored-by: Erik Sundell --- docs/topic/billing/chargeable-resources.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index 46c15d267c..a0795038bf 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -561,13 +561,9 @@ separate from the nodes used to run various things in it. This is usually a *fix 2. [GCP GKE](https://cloud.google.com/kubernetes-engine/pricing#standard_edition) - Is a bit messier - it costs $0.10 per hour, but is structured in such a way - that one single-zone (non highly available) cluster is free per GCP project. - While this is a cost saving of about 75$ per month, in most cases we - use *regional* (multi-zonal) clusters that are not free. This makes - master upgrades much more painfree, and reduces overall amount of downtime - the hub has. So our default here is to pay the 75$ a month for the - regional cluster. + Costs $0.10 per hour flat. + + There is a discount of about 75$ per month allowing one *zonal* (single zone, non highly available) cluster per GCP billing account continuously for free. This discount is of little value to us and communities since since we favor the *regional* (multi-zone, highly available) clusters for less disruptive kubernetes control plane upgrades, and since we don't manage that many separate billing accounts. GKE also has an 'autopilot' edition that we have tried to use in the past and decided against using, for various restrictions it has on From 318db9c16e99a9978e611f104089f8b4235b612a Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Mon, 18 Mar 2024 14:55:44 -0700 Subject: [PATCH 25/36] Cleanup long line --- docs/topic/billing/chargeable-resources.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index a0795038bf..b9daa57940 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -562,8 +562,13 @@ separate from the nodes used to run various things in it. This is usually a *fix 2. [GCP GKE](https://cloud.google.com/kubernetes-engine/pricing#standard_edition) Costs $0.10 per hour flat. - - There is a discount of about 75$ per month allowing one *zonal* (single zone, non highly available) cluster per GCP billing account continuously for free. This discount is of little value to us and communities since since we favor the *regional* (multi-zone, highly available) clusters for less disruptive kubernetes control plane upgrades, and since we don't manage that many separate billing accounts. + + There is a discount of about 75$ per month allowing one *zonal* (single zone, + non highly available) cluster per GCP billing account continuously for free. + This discount is of little value to us and communities since since we favor + the *regional* (multi-zone, highly available) clusters for less disruptive + kubernetes control plane upgrades, and since we don't manage that many + separate billing accounts. GKE also has an 'autopilot' edition that we have tried to use in the past and decided against using, for various restrictions it has on From 74ca80ce2a7ad591191930a6dcf468c54fbbac1e Mon Sep 17 00:00:00 2001 From: sean-morris Date: Mon, 18 Mar 2024 15:43:29 -0700 Subject: [PATCH 26/36] [CloudBank] SRJC Admins and Microsoft IDP --- config/clusters/cloudbank/srjc.values.yaml | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/config/clusters/cloudbank/srjc.values.yaml b/config/clusters/cloudbank/srjc.values.yaml index dd2228325f..5cf8873368 100644 --- a/config/clusters/cloudbank/srjc.values.yaml +++ b/config/clusters/cloudbank/srjc.values.yaml @@ -36,6 +36,11 @@ jupyterhub: CILogonOAuthenticator: oauth_callback_url: https://srjc.cloudbank.2i2c.cloud/hub/oauth_callback allowed_idps: + http://login.microsoftonline.com/common/oauth2/v2.0/authorize: + username_derivation: + username_claim: "email" + allowed_domains: + - "santarosa.edu" http://google.com/accounts/o8/id: default: true username_derivation: @@ -47,4 +52,6 @@ jupyterhub: - ericvd@berkeley.edu - sean.smorris@berkeley.edu - mmckeever@santarosa.edu - - mjmckeever496@gmail.com + - ylin@santarosa.edu + - hwong@santarosa.edu + - dharden@santarosa.edu From ede6494f500c4ad0f35ef090d68f8da4fd08ad5b Mon Sep 17 00:00:00 2001 From: sean-morris Date: Mon, 18 Mar 2024 16:42:08 -0700 Subject: [PATCH 27/36] [CloudBank] SBCC remove inCommon --- config/clusters/cloudbank/sbcc-dev.values.yaml | 4 ---- config/clusters/cloudbank/sbcc.values.yaml | 4 ---- 2 files changed, 8 deletions(-) diff --git a/config/clusters/cloudbank/sbcc-dev.values.yaml b/config/clusters/cloudbank/sbcc-dev.values.yaml index 70bfb1d21a..664e62179e 100644 --- a/config/clusters/cloudbank/sbcc-dev.values.yaml +++ b/config/clusters/cloudbank/sbcc-dev.values.yaml @@ -30,10 +30,6 @@ jupyterhub: CILogonOAuthenticator: oauth_callback_url: "https://sbcc-dev.cloudbank.2i2c.cloud/hub/oauth_callback" allowed_idps: - https://idp.sbcc.edu/idp/shibboleth: - default: true - username_derivation: - username_claim: "email" http://google.com/accounts/o8/id: username_derivation: username_claim: "email" diff --git a/config/clusters/cloudbank/sbcc.values.yaml b/config/clusters/cloudbank/sbcc.values.yaml index 7edbf3a6ca..c63f644e60 100644 --- a/config/clusters/cloudbank/sbcc.values.yaml +++ b/config/clusters/cloudbank/sbcc.values.yaml @@ -30,10 +30,6 @@ jupyterhub: CILogonOAuthenticator: oauth_callback_url: "https://sbcc.cloudbank.2i2c.cloud/hub/oauth_callback" allowed_idps: - https://idp.sbcc.edu/idp/shibboleth: - default: true - username_derivation: - username_claim: "email" http://google.com/accounts/o8/id: username_derivation: username_claim: "email" From a0ef479d44fe7179369b969add983a2c2ec0dd43 Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Mon, 18 Mar 2024 18:18:14 -0700 Subject: [PATCH 28/36] Clarify wording Co-authored-by: Erik Sundell --- docs/topic/billing/tools.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/topic/billing/tools.md b/docs/topic/billing/tools.md index 67252d43d8..04b6568525 100644 --- a/docs/topic/billing/tools.md +++ b/docs/topic/billing/tools.md @@ -20,9 +20,9 @@ count *all* costs emanating from any resource tagged with that tag. For example, if a node is tagged with a particular tag, the following separate things will be associated with the tag: -1. The amount of memory allocated to that node -2. The amount of CPU allocated to the node -3. (If tagged correctly) The base disk allocated to that node +1. The node's memory +2. The node's CPU +3. (If tagged correctly) The node's boot disk 4. Any network costs accrued by actions of processes on that node (subject to some complexity) From 145be54d3a720f4d8ed2aec5ffeefc1e1383805b Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Mon, 18 Mar 2024 18:41:52 -0700 Subject: [PATCH 29/36] Address feedback from Bri Lind Co-authored-by: Bri <104585874+BriannaLind@users.noreply.github.com> --- docs/topic/billing/chargeable-resources.md | 51 +++++++++++++--------- 1 file changed, 31 insertions(+), 20 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index b9daa57940..79710697a7 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -13,12 +13,13 @@ have a baseline to reason from. ## Nodes / Virtual Machines -The unit of *compute* in the cloud is a virtual machine (a 'node' in +The unit of *compute* in the cloud is a virtual machine (a +[node](https://kubernetes.io/docs/concepts/architecture/nodes/) in kubernetes terminology). It is assembled from the following components: 1. CPU (count and architecture) -2. Memory (just size) +2. Memory (only capacity) 3. Boot disk (size and type, ephemeral) 4. Optional accelerators (like GPUs, TPUs, network accelerators, etc) @@ -35,7 +36,7 @@ CPU specification for a node is three components: The 'default' everyone assumes is x86_64 for architecture, and some sort of recent intel processor for make. So if someone says 'I want 4 CPUs', we can -currently assume they just mean they want 4 x86_64 Intel CPUs. x86_64 intel +currently assume they mean they want 4 x86_64 Intel CPUs. x86_64 intel CPUs are also often the most expensive - AMD CPUs are cheaper, ARM CPUs are cheaper still. So it's a possible place to make cost differences - *but*, some software may not work correctly on different CPUs. While ARM support @@ -51,15 +52,17 @@ of a node costs for our usage pattern. ### Memory -Unlike CPU, memory is simpler. A GB is a GB! We pay for how much GB the -VM we request has, regardless of whether we use it or not. +Unlike CPU, memory is simpler - capacity is what matters. A GB of memory is a GB +of memory! We pay for how much memory capacity the VM we request has, regardless +of whether we use it or not. ### Boot disk (ephemeral) Each node needs some disk space that lives only as long as the node is up, to store: -1. The basic underlying OS that runs the node itself (Ubuntu, [COS](https://cloud.google.com/container-optimized-os/docs) or similar) +1. The basic underlying operating system that runs the node itself (Ubuntu, + [COS](https://cloud.google.com/container-optimized-os/docs) or similar) 2. Storage for any docker images that need to be pulled and used on the node 3. Temporary storage for any changes made to the *running containers* in a pod. For example, if a user runs `pip install` or `conda install` in a @@ -143,9 +146,13 @@ not, cloud providers will charge us for it regardless of whether the user is act using it or not. Hence, choosing what cloud products we use here becomes crucial - if we pick something that is a poor fit, storage costs can easily dwarf everything else. -The default from z2jh is to use [Dynamic Volume Provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/), -which sets up a *separate, dedicated* block storage device (think of it like a hard drive) -(([EBS](https://aws.amazon.com/ebs/) on AWS, [Persistent Disk](https://cloud.google.com/persistent-disk?hl=en) on GCP and [Disk Storage](https://azure.microsoft.com/en-us/products/storage/disks) on Azure)) +The default from [z2jh](https://z2jh.jupyter.org/en/stable/) is to use [Dynamic +Volume +Provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/), +which sets up a *separate, dedicated* block storage device (think of it like a +hard drive) (([EBS](https://aws.amazon.com/ebs/) on AWS, [Persistent +Disk](https://cloud.google.com/persistent-disk?hl=en) on GCP and [Disk +Storage](https://azure.microsoft.com/en-us/products/storage/disks) on Azure)) *per user*. This has the following advantages: 1. Since there is one drive per user, a user filling up their disk will only affect them - @@ -204,7 +211,7 @@ need to be performed, we the users of the cloud are not privy to any of that. Instead, we operate in the world of [software defined networking](https://www.cloudflare.com/learning/network-layer/what-is-sdn/) (or SDN). The fundamental idea here is that instead of having to phsycially connect cables around, you can define network configuration in software somehow, -and it 'just works'. Without SDN, instead of just creating two VMs in the +and it 'just works'. Without SDN, instead of simply creating two VMs in the same network from a web UI and expecting them to communicate with each other, you would have to email support, and someone will have to go physically plug in a cable from somewhere to somewhere else. Instead of being annoyed that @@ -219,8 +226,8 @@ Depending on the source and destination of this transmission, **we may get charg so it's important to have a good mental model of how network costs work when thinking about cloud costs. -While the overall topic of network costs in the cloud is complex, thankfully -we can get away with a simplified model when operating just JupyterHubs. +While the overall topic of network costs in the cloud is complex, if we are only concerned +with operating JupyterHubs, we can get away with a simplified model. ### Network Boundaries @@ -286,9 +293,12 @@ A core use of JupyterHub is to put the compute near where the data is, so *most likely* we are simply not going to see any significant egress fees at all, as we are accessing data in the same region / zone. Object storage access is the primary point of contention here for the kind of infrastructure -we manage, so as long as we are careful about that we should be ok. More -information about managing this is present in the [object storage section](topic:billing:resources:object-storage) -of this guide. +we manage, so as long as we are careful about that we should be ok. If you have data +in more than one region / zone, set up a hub in each region / zone you have data in - +move the compute to where the data is, rather than the other way around. + +More information about managing this is present in the [object storage +section](topic:billing:resources:object-storage) of this guide. (topic:billing:resources:object-storage)= ## Object storage @@ -355,7 +365,7 @@ take that data with you somewhere else. This is egregious enough now that noticing](https://www.cnbc.com/2023/10/05/amazon-and-microsofts-cloud-dominance-referred-for-uk-competition-probe.html), and that is probably the long term solution. -In the meantime, the answer is fairly simple - just **never** expose your +In the meantime, the answer is fairly simple - **never** expose your object store buckets to the public internet! Treat them as 'internal' storage only. JupyterHubs should be put in the same region as the object store buckets they need to access, so this access can be free of cost, @@ -405,7 +415,7 @@ the per GB charge usually doesn't add up to much (even uploading 1 terabyte of d ## Access to the external internet Users on JupyterHubs generally expect to be able to access the external -internet, rather than just internal services. This requires one of two +internet, rather than only internal services. This requires one of two solutions: 1. Each **node** of our kubernetes cluster gets an **external IP**, so @@ -471,9 +481,10 @@ layer of security as well. One of the many memes about Cloud NAT being expensive, from [QuinnyPig](https://twitter.com/QuinnyPig/status/1357391731902341120). Many seem far more violent. See [this post](https://www.lastweekinaws.com/blog/the-aws-managed-nat-gateway-is-unpleasant-and-not-recommended/) for more information. ``` -However, it is the **single most expensive** thing one can do on any cloud -provider, and must be avoided at all costs. Instead of data *ingress* -being free, it becomes pretty incredibly expensive. +However, using a cloud NAT for outbound internet access is the **single most +expensive** thing one can do on any cloud provider, and must be avoided at all +costs. Instead of data *ingress* being free, it becomes pretty incredibly +expensive. | Cloud Provider | Cost | Pricing Link | | - | - | - | From dd787c0a17c3061bda80d224ea16593c6d0733ae Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Tue, 19 Mar 2024 08:53:28 +0100 Subject: [PATCH 30/36] Always check for latest rocker/binder image as tags get rebuilt --- .../clusters/2i2c-aws-us/showcase.values.yaml | 2 + .../catalystproject-africa/common.values.yaml | 1 + .../catalystproject-latam/common.values.yaml | 1 + config/clusters/earthscope/common.values.yaml | 1 + config/clusters/hhmi/common.values.yaml | 1 + config/clusters/nasa-ghg/common.values.yaml | 1 + config/clusters/nasa-veda/common.values.yaml | 1 + .../clusters/opensci/sciencecore.values.yaml | 1 + config/clusters/opensci/staging.values.yaml | 1 + .../clusters/smithsonian/common.values.yaml | 1 + docs/howto/features/rocker.md | 43 ++++++++++--------- 11 files changed, 33 insertions(+), 21 deletions(-) diff --git a/config/clusters/2i2c-aws-us/showcase.values.yaml b/config/clusters/2i2c-aws-us/showcase.values.yaml index aba420a672..5c3c69d86d 100644 --- a/config/clusters/2i2c-aws-us/showcase.values.yaml +++ b/config/clusters/2i2c-aws-us/showcase.values.yaml @@ -100,6 +100,7 @@ basehub: slug: rocker-geospatial kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch into RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir @@ -159,6 +160,7 @@ basehub: slug: rocker-geospatial kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch into RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir diff --git a/config/clusters/catalystproject-africa/common.values.yaml b/config/clusters/catalystproject-africa/common.values.yaml index f5e468b905..d59776bd4a 100644 --- a/config/clusters/catalystproject-africa/common.values.yaml +++ b/config/clusters/catalystproject-africa/common.values.yaml @@ -64,6 +64,7 @@ jupyterhub: slug: rocker kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always default_url: /rstudio working_dir: /home/rstudio # Ensures container working dir is homedir node_selector: diff --git a/config/clusters/catalystproject-latam/common.values.yaml b/config/clusters/catalystproject-latam/common.values.yaml index c0f5cd07d8..a6f86b9b7b 100644 --- a/config/clusters/catalystproject-latam/common.values.yaml +++ b/config/clusters/catalystproject-latam/common.values.yaml @@ -50,6 +50,7 @@ jupyterhub: slug: rocker kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always default_url: /rstudio working_dir: /home/rstudio # Ensures container working dir is homedir profile_options: *profile_options diff --git a/config/clusters/earthscope/common.values.yaml b/config/clusters/earthscope/common.values.yaml index 29b2c8b9d1..7ae4ccdede 100644 --- a/config/clusters/earthscope/common.values.yaml +++ b/config/clusters/earthscope/common.values.yaml @@ -180,6 +180,7 @@ basehub: slug: rocker-geospatial kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch into RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir diff --git a/config/clusters/hhmi/common.values.yaml b/config/clusters/hhmi/common.values.yaml index 71db9fc216..7eeaa3d344 100644 --- a/config/clusters/hhmi/common.values.yaml +++ b/config/clusters/hhmi/common.values.yaml @@ -165,6 +165,7 @@ basehub: slug: rocker kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir diff --git a/config/clusters/nasa-ghg/common.values.yaml b/config/clusters/nasa-ghg/common.values.yaml index 169ef7a1e4..dafeaf9633 100644 --- a/config/clusters/nasa-ghg/common.values.yaml +++ b/config/clusters/nasa-ghg/common.values.yaml @@ -140,6 +140,7 @@ basehub: description: R environment with many geospatial libraries pre-installed kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir diff --git a/config/clusters/nasa-veda/common.values.yaml b/config/clusters/nasa-veda/common.values.yaml index 44dc2eb07a..9d82aa4e3e 100644 --- a/config/clusters/nasa-veda/common.values.yaml +++ b/config/clusters/nasa-veda/common.values.yaml @@ -177,6 +177,7 @@ basehub: description: R environment with many geospatial libraries pre-installed kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir diff --git a/config/clusters/opensci/sciencecore.values.yaml b/config/clusters/opensci/sciencecore.values.yaml index 60688b0691..446ecb418b 100644 --- a/config/clusters/opensci/sciencecore.values.yaml +++ b/config/clusters/opensci/sciencecore.values.yaml @@ -56,6 +56,7 @@ jupyterhub: slug: geospatial kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch into RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir diff --git a/config/clusters/opensci/staging.values.yaml b/config/clusters/opensci/staging.values.yaml index 4fc5552a9b..b84f9a548b 100644 --- a/config/clusters/opensci/staging.values.yaml +++ b/config/clusters/opensci/staging.values.yaml @@ -56,6 +56,7 @@ jupyterhub: slug: geospatial kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch into RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir diff --git a/config/clusters/smithsonian/common.values.yaml b/config/clusters/smithsonian/common.values.yaml index b313591b39..22bdc29821 100644 --- a/config/clusters/smithsonian/common.values.yaml +++ b/config/clusters/smithsonian/common.values.yaml @@ -87,6 +87,7 @@ basehub: slug: geospatial kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always working_dir: /home/rstudio default_url: /rstudio scipy: diff --git a/docs/howto/features/rocker.md b/docs/howto/features/rocker.md index b27777e0bf..2c120a81d9 100644 --- a/docs/howto/features/rocker.md +++ b/docs/howto/features/rocker.md @@ -96,27 +96,28 @@ When using profileLists, it should probably look like this: jupyterhub: singleuser: profileList: - - display_name: "Some Profile Name" - profile_options: - image: - display_name: Image - choices: - geospatial: - display_name: Rocker Geospatial - default: true - slug: geospatial - kubespawner_override: - image: rocker/binder:4.3 - # Launch into RStudio after the user logs in - default_url: /rstudio - # Ensures container working dir is homedir - # https://github.com/2i2c-org/infrastructure/issues/2559 - working_dir: /home/rstudio - scipy: - display_name: Jupyter SciPy Notebook - slug: scipy - kubespawner_override: - image: jupyter/scipy-notebook:2023-06-26 + - display_name: "Some Profile Name" + profile_options: + image: + display_name: Image + choices: + geospatial: + display_name: Rocker Geospatial + default: true + slug: geospatial + kubespawner_override: + image: rocker/binder:4.3 + image_pull_policy: Always + # Launch into RStudio after the user logs in + default_url: /rstudio + # Ensures container working dir is homedir + # https://github.com/2i2c-org/infrastructure/issues/2559 + working_dir: /home/rstudio + scipy: + display_name: Jupyter SciPy Notebook + slug: scipy + kubespawner_override: + image: jupyter/scipy-notebook:2023-06-26 ``` ## User installed packages From 7f27082d864877cf6162e910042fd53b92f5bb2e Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Tue, 19 Mar 2024 09:06:49 +0100 Subject: [PATCH 31/36] Use quay.io for jupyter/docker-stacks project's images --- config/clusters/2i2c-aws-us/showcase.values.yaml | 8 ++++---- config/clusters/2i2c/imagebuilding-demo.values.yaml | 3 ++- config/clusters/earthscope/common.values.yaml | 2 +- .../clusters/jupyter-meets-the-earth/common.values.yaml | 2 +- config/clusters/opensci/sciencecore.values.yaml | 2 +- config/clusters/opensci/staging.values.yaml | 2 +- config/clusters/smithsonian/common.values.yaml | 2 +- docs/howto/features/imagebuilding.md | 2 +- docs/howto/features/rocker.md | 2 +- helm-charts/basehub/values.yaml | 4 ++-- 10 files changed, 15 insertions(+), 14 deletions(-) diff --git a/config/clusters/2i2c-aws-us/showcase.values.yaml b/config/clusters/2i2c-aws-us/showcase.values.yaml index 5c3c69d86d..46e1990f3b 100644 --- a/config/clusters/2i2c-aws-us/showcase.values.yaml +++ b/config/clusters/2i2c-aws-us/showcase.values.yaml @@ -89,12 +89,12 @@ basehub: display_name: Jupyter SciPy slug: jupyter-scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-27 + image: quay.io/jupyter/scipy-notebook:2024-03-18 jupyter-datascience: display_name: Jupyter DataScience slug: jupyter-datascience kubespawner_override: - image: jupyter/datascience-notebook:2023-06-27 + image: quay.io/jupyter/datascience-notebook:2024-03-18 rocker-geospatial: display_name: Rocker Geospatial slug: rocker-geospatial @@ -149,12 +149,12 @@ basehub: display_name: Jupyter SciPy slug: jupyter-scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-27 + image: quay.io/jupyter/scipy-notebook:2024-03-18 jupyter-datascience: display_name: Jupyter DataScience slug: jupyter-datascience kubespawner_override: - image: jupyter/datascience-notebook:2023-06-27 + image: quay.io/jupyter/datascience-notebook:2024-03-18 rocker-geospatial: display_name: Rocker Geospatial slug: rocker-geospatial diff --git a/config/clusters/2i2c/imagebuilding-demo.values.yaml b/config/clusters/2i2c/imagebuilding-demo.values.yaml index e6ac20b8e6..1b256d8ea6 100644 --- a/config/clusters/2i2c/imagebuilding-demo.values.yaml +++ b/config/clusters/2i2c/imagebuilding-demo.values.yaml @@ -57,6 +57,7 @@ jupyterhub: slug: geospatial kubespawner_override: image: rocker/binder:4.3 + image_pull_policy: Always # Launch into RStudio after the user logs in default_url: /rstudio # Ensures container working dir is homedir @@ -66,7 +67,7 @@ jupyterhub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-26 + image: quay.io/jupyter/scipy-notebook:2024-03-18 resources: display_name: Resource Allocation choices: diff --git a/config/clusters/earthscope/common.values.yaml b/config/clusters/earthscope/common.values.yaml index 7ae4ccdede..603ef7cac7 100644 --- a/config/clusters/earthscope/common.values.yaml +++ b/config/clusters/earthscope/common.values.yaml @@ -174,7 +174,7 @@ basehub: display_name: Jupyter slug: jupyter-scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-27 + image: quay.io/jupyter/scipy-notebook:2024-03-18 rocker-geospatial: display_name: RStudio slug: rocker-geospatial diff --git a/config/clusters/jupyter-meets-the-earth/common.values.yaml b/config/clusters/jupyter-meets-the-earth/common.values.yaml index e45f1810ab..4dd81c8ece 100644 --- a/config/clusters/jupyter-meets-the-earth/common.values.yaml +++ b/config/clusters/jupyter-meets-the-earth/common.values.yaml @@ -128,7 +128,7 @@ basehub: display_name: Jupyter DockerStacks Julia Notebook slug: "julia" kubespawner_override: - image: "jupyter/julia-notebook:2023-07-05" + image: "quay.io/jupyter/julia-notebook:2024-03-18" - display_name: "4th of Medium: 1-4 CPU, 4-16 GB" description: "A shared machine." profile_options: *profile_options diff --git a/config/clusters/opensci/sciencecore.values.yaml b/config/clusters/opensci/sciencecore.values.yaml index 446ecb418b..4358928b2c 100644 --- a/config/clusters/opensci/sciencecore.values.yaml +++ b/config/clusters/opensci/sciencecore.values.yaml @@ -66,7 +66,7 @@ jupyterhub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-26 + image: quay.io/jupyter/scipy-notebook:2024-03-18 resources: display_name: Resource Allocation choices: diff --git a/config/clusters/opensci/staging.values.yaml b/config/clusters/opensci/staging.values.yaml index b84f9a548b..85da925f9b 100644 --- a/config/clusters/opensci/staging.values.yaml +++ b/config/clusters/opensci/staging.values.yaml @@ -66,7 +66,7 @@ jupyterhub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-26 + image: quay.io/jupyter/scipy-notebook:2024-03-18 resources: display_name: Resource Allocation choices: diff --git a/config/clusters/smithsonian/common.values.yaml b/config/clusters/smithsonian/common.values.yaml index 22bdc29821..cd60608525 100644 --- a/config/clusters/smithsonian/common.values.yaml +++ b/config/clusters/smithsonian/common.values.yaml @@ -94,7 +94,7 @@ basehub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: "jupyter/scipy-notebook:2023-09-04" + image: "quay.io/jupyter/scipy-notebook:2024-03-18" pangeo: display_name: Pangeo Notebook slug: pangeo diff --git a/docs/howto/features/imagebuilding.md b/docs/howto/features/imagebuilding.md index c1c9ad8bb0..1465b315df 100644 --- a/docs/howto/features/imagebuilding.md +++ b/docs/howto/features/imagebuilding.md @@ -66,7 +66,7 @@ jupyterhub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-26 + image: quay.io/jupyter/scipy-notebook:2024-03-18 ``` ## Setup the image registry diff --git a/docs/howto/features/rocker.md b/docs/howto/features/rocker.md index 2c120a81d9..aae395c121 100644 --- a/docs/howto/features/rocker.md +++ b/docs/howto/features/rocker.md @@ -117,7 +117,7 @@ jupyterhub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: jupyter/scipy-notebook:2023-06-26 + image: quay.io/jupyter/scipy-notebook:2024-03-18 ``` ## User installed packages diff --git a/helm-charts/basehub/values.yaml b/helm-charts/basehub/values.yaml index 56e3ade80a..33def114a3 100644 --- a/helm-charts/basehub/values.yaml +++ b/helm-charts/basehub/values.yaml @@ -352,8 +352,8 @@ jupyterhub: startTimeout: 600 # 10 mins, node startup + image pulling sometimes takes more than the default 5min defaultUrl: /tree image: - name: jupyter/scipy-notebook - tag: "2023-06-19" + name: quay.io/jupyter/scipy-notebook + tag: "2024-03-18" storage: type: static static: From dc976a4f63b62463da7c7cb52d34eac3a17e9335 Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Tue, 19 Mar 2024 09:29:15 +0100 Subject: [PATCH 32/36] Avoid possibly breaking change, use fixme notes --- config/clusters/earthscope/common.values.yaml | 3 ++- config/clusters/jupyter-meets-the-earth/common.values.yaml | 3 ++- config/clusters/opensci/sciencecore.values.yaml | 3 ++- config/clusters/opensci/staging.values.yaml | 3 ++- config/clusters/smithsonian/common.values.yaml | 3 ++- 5 files changed, 10 insertions(+), 5 deletions(-) diff --git a/config/clusters/earthscope/common.values.yaml b/config/clusters/earthscope/common.values.yaml index 603ef7cac7..ff61a460ec 100644 --- a/config/clusters/earthscope/common.values.yaml +++ b/config/clusters/earthscope/common.values.yaml @@ -174,7 +174,8 @@ basehub: display_name: Jupyter slug: jupyter-scipy kubespawner_override: - image: quay.io/jupyter/scipy-notebook:2024-03-18 + # FIXME: use quay.io/ for tags after 2023-10-20 + image: jupyter/scipy-notebook:2023-06-27 rocker-geospatial: display_name: RStudio slug: rocker-geospatial diff --git a/config/clusters/jupyter-meets-the-earth/common.values.yaml b/config/clusters/jupyter-meets-the-earth/common.values.yaml index 4dd81c8ece..c5c53073f7 100644 --- a/config/clusters/jupyter-meets-the-earth/common.values.yaml +++ b/config/clusters/jupyter-meets-the-earth/common.values.yaml @@ -128,7 +128,8 @@ basehub: display_name: Jupyter DockerStacks Julia Notebook slug: "julia" kubespawner_override: - image: "quay.io/jupyter/julia-notebook:2024-03-18" + # FIXME: use quay.io/ for tags after 2023-10-20 + image: "jupyter/julia-notebook:2023-07-05" - display_name: "4th of Medium: 1-4 CPU, 4-16 GB" description: "A shared machine." profile_options: *profile_options diff --git a/config/clusters/opensci/sciencecore.values.yaml b/config/clusters/opensci/sciencecore.values.yaml index 4358928b2c..fe7d541feb 100644 --- a/config/clusters/opensci/sciencecore.values.yaml +++ b/config/clusters/opensci/sciencecore.values.yaml @@ -66,7 +66,8 @@ jupyterhub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: quay.io/jupyter/scipy-notebook:2024-03-18 + # FIXME: use quay.io/ for tags after 2023-10-20 + image: jupyter/scipy-notebook:2023-06-26 resources: display_name: Resource Allocation choices: diff --git a/config/clusters/opensci/staging.values.yaml b/config/clusters/opensci/staging.values.yaml index 85da925f9b..dc440a8d1d 100644 --- a/config/clusters/opensci/staging.values.yaml +++ b/config/clusters/opensci/staging.values.yaml @@ -66,7 +66,8 @@ jupyterhub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: quay.io/jupyter/scipy-notebook:2024-03-18 + # FIXME: use quay.io/ for tags after 2023-10-20 + image: jupyter/scipy-notebook:2023-06-26 resources: display_name: Resource Allocation choices: diff --git a/config/clusters/smithsonian/common.values.yaml b/config/clusters/smithsonian/common.values.yaml index cd60608525..07f200e906 100644 --- a/config/clusters/smithsonian/common.values.yaml +++ b/config/clusters/smithsonian/common.values.yaml @@ -94,7 +94,8 @@ basehub: display_name: Jupyter SciPy Notebook slug: scipy kubespawner_override: - image: "quay.io/jupyter/scipy-notebook:2024-03-18" + # FIXME: use quay.io/ for tags after 2023-10-20 + image: "jupyter/scipy-notebook:2023-09-04" pangeo: display_name: Pangeo Notebook slug: pangeo From 9efa03cbf71b97c937a6172da012bff3cd75c651 Mon Sep 17 00:00:00 2001 From: Brian Freitag Date: Tue, 19 Mar 2024 13:44:45 -0500 Subject: [PATCH 33/36] added access to sentinel-cogs bucket --- terraform/aws/projects/nasa-veda.tfvars | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/terraform/aws/projects/nasa-veda.tfvars b/terraform/aws/projects/nasa-veda.tfvars index 8baaa0ae79..781621c33e 100644 --- a/terraform/aws/projects/nasa-veda.tfvars +++ b/terraform/aws/projects/nasa-veda.tfvars @@ -63,7 +63,9 @@ hub_cloud_permissions = { "arn:aws:s3:::sdap-dev-zarr", "arn:aws:s3:::sdap-dev-zarr/*", "arn:aws:s3:::usgs-landsat", - "arn:aws:s3:::usgs-landsat/*" + "arn:aws:s3:::usgs-landsat/*", + "arn:aws:s3:::sentinel-cogs", + "arn:aws:s3:::sentinel-cogs/*", ] }, { From 1a232535d47da618975728672048686a616a6f6f Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Tue, 19 Mar 2024 22:16:03 -0700 Subject: [PATCH 34/36] Fix typos Co-authored-by: Damian Avila --- docs/topic/billing/chargeable-resources.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/topic/billing/chargeable-resources.md b/docs/topic/billing/chargeable-resources.md index 79710697a7..985fa85fde 100644 --- a/docs/topic/billing/chargeable-resources.md +++ b/docs/topic/billing/chargeable-resources.md @@ -121,7 +121,7 @@ memory bound, not CPU bound. ### Autoscaling makes nodes temporary -Nodes in our workloads are are *temporary* - they come and go with user +Nodes in our workloads are *temporary* - they come and go with user needs, and get replaced when version upgrades happen. They have names like `gke-jup-test-default-pool-47300cca-wk01` - notice the random string of characters towards the end. The [cluster autoscaler](https://github.com/kubernetes/autoscaler/) @@ -255,7 +255,7 @@ boundaries are crossed and usually this costs no money. As a general rule, when each boundary gets crossed, the cost of the data transmission goes up. Within a cloud provider, transmission between regions is often priced based on the distance between them - australia to -the US is more expensive than canada to the US. Data transmission from within a cloud provider to outside is the most expensive, and is also often +the US is more expensive than Canada to the US. Data transmission from within a cloud provider to outside is the most expensive, and is also often metered by geographical location. This also applies to data present in any *cloud services* as well, although From 6a873143e21a8fbf7b0ac54bdd0c10aed6c4b578 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Tue, 19 Mar 2024 22:29:35 -0700 Subject: [PATCH 35/36] Clarify what 'here' refers to --- docs/topic/billing/reports.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/topic/billing/reports.md b/docs/topic/billing/reports.md index ef9eb854ee..791170424a 100644 --- a/docs/topic/billing/reports.md +++ b/docs/topic/billing/reports.md @@ -37,7 +37,7 @@ AWS account. AWS has a million ways for you to slice and dice these reports, but the most helpful place to start is the "Cost Explorer Saved Reports". This provides 'Monthly costs by linked account' and 'Monthly costs by service'. Once you open those reports, you can further filter and group -as you wish. A CSV of reports can also be downloaded here. +as you wish. A CSV of reports can also be downloaded from this page. ### Azure From a24d02e553a9edd38e094d3e660b5497e8363a11 Mon Sep 17 00:00:00 2001 From: Sarah Gibson <44771837+sgibson91@users.noreply.github.com> Date: Wed, 20 Mar 2024 11:00:55 +0000 Subject: [PATCH 36/36] Remove trailing comma Co-authored-by: Erik Sundell --- terraform/aws/projects/nasa-veda.tfvars | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/terraform/aws/projects/nasa-veda.tfvars b/terraform/aws/projects/nasa-veda.tfvars index 781621c33e..fae2cfff2f 100644 --- a/terraform/aws/projects/nasa-veda.tfvars +++ b/terraform/aws/projects/nasa-veda.tfvars @@ -65,7 +65,7 @@ hub_cloud_permissions = { "arn:aws:s3:::usgs-landsat", "arn:aws:s3:::usgs-landsat/*", "arn:aws:s3:::sentinel-cogs", - "arn:aws:s3:::sentinel-cogs/*", + "arn:aws:s3:::sentinel-cogs/*" ] }, {