[New Hub] BICAN (MIT Brain) #3827

yuvipanda · 2024-03-21T02:22:07Z

Copied over from https://github.com/2i2c-org/meta/issues/913

Process Note

I'm using this as a way to try to rejig our new hub request process. See https://github.com/2i2c-org/meta/issues/897 (particularly https://github.com/2i2c-org/meta/issues/897#issuecomment-2010984904) for more information.

https://miro.com/app/board/uXjVNjUP3iQ=/, describes the various 'phases' of new hub turn-up. Each phase will be marked as "READY" or "NOT READY" when all information needed for it is available. Each section should also link to an appropriate runbook.

There will be customizations after this is all set up, but this is pathway towards a standardized hub turn up.

Phase 1: Account setup (READY)

This is applicable for cases where this is a dedicated cluster. The following table lists the information before this phase can start.

Question	Answer
Cloud Provider	AWS
Will 2i2c pay for cloud costs?	Yes
Name of cloud account	`bican`

Appropriate runbook: https://infrastructure.2i2c.org/hub-deployment-guide/cloud-accounts/new-aws-account/

Phase 2: Cluster setup (READY)

This assumes all engineers have access to this new account, and will be able to set up the cluster + support, without any new hubs being set up.

Question	Answer
Region / Zone of the cluster	us-east-2
Name of cluster	bican
Is GPU required?	yes

Appropriate runbooks:

New cluster setup: https://infrastructure.2i2c.org/hub-deployment-guide/new-cluster/aws/
Support deployment: https://infrastructure.2i2c.org/hub-deployment-guide/deploy-support/

Phase 3 : Hub setup (READY)

There's going to be a number of hubs, and this starts specifying them.

Hub 1: Staging

Phase 3.1: Initial setup

Question	Answer	Notes
Name of the hub	staging
Dask gateway?	no
Splash image	https://static.wixstatic.com/media/c57c96_a9e46e43008349c8b65dcacc3ceaba35~mv2.png/v1/fill/w_402,h_150,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/_BICAN-full-logo-final.png
URL	https://www.portal.brain-bican.org/

Phase 3.2: Authentication

Question	Answer
Authentication Mechanism	GitHub (via GitHubOAuthenticator)
Org based access?	No
Admin Users	`@kabilar, @aaronkanzer, @asmacdo, @satra, @djarecka`

Phase 3.3: Object storage access

Question	Answer	Notes
Scratch bucket enabled?	Yes
Persistent bucket enabled?	no
Requestor pays requests to external buckets allowed?	no

Phase 3.4: Profile List

This was derived from looking at https://github.com/dandi/dandi-hub/blob/dandi/config.yaml.j2#L138-L210 and adopting to match our standards.

Environments

Display Name	Description	Overrides	Resource Allocation Choices
DANDI (CPU)	Default DANDI image with JupyterLab	image: dandiarchive/dandihub:latest image_pull_policy: Always	CPU (see below)
DANDI Matlab (CPU)	DANDI image with MATLAB. Requires you to bring your own license	image: dandiarchive/dandihub:latest-matlab image_pull_policy: Always	CPU
DANDI (GPU)	DANDI image with JupyterLab and GPU support	image: dandiarchive/dandihub:latest-gpu image_pull_policy: Always extra_resource_limits: nvidia.com/gpu: 1	GPU
DANDI Matlab (GPU)	DANDI Matlab image with GPU support. Requires you to bring your own license.	image: dandiarchive/dandihub:latest-gpu-matlab image_pull_policy: Always extra_resource_limits: nvidia.com/gpu: 1	GPU

Resource Allocations

CPU

Generated by deployer generate resource-allocation choices r5.xlarge --num-allocations 4

mem_3_7:
  display_name: 3.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 3982682624
    mem_limit: 3982682624
    cpu_guarantee: 0.46875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
  default: true
mem_7_4:
  display_name: 7.4 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 7965365248
    mem_limit: 7965365248
    cpu_guarantee: 0.9375
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_14_8:
  display_name: 14.8 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 15930730496
    mem_limit: 15930730496
    cpu_guarantee: 1.875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_29_7:
  display_name: 29.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 31861460992
    mem_limit: 31861460992
    cpu_guarantee: 3.75
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_60_6:
  display_name: 60.6 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 65094813696
    mem_limit: 65094813696
    cpu_guarantee: 7.86
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_121_2:
  display_name: 121.2 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 130189627392
    mem_limit: 130189627392
    cpu_guarantee: 15.72
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_244_9:
  display_name: 244.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 263005526016
    mem_limit: 263005526016
    cpu_guarantee: 31.8
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
mem_489_9:
  display_name: 489.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 526011052032
    mem_limit: 526011052032
    cpu_guarantee: 63.6
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge

GPU

Manually set up, but should be autogenerated

gpu_1:
  display_name: 1 T4 GPU, ~4 CPUs, ~16GB of RAM
  kubespawner_override:
    mem_guarantee: 14G
    mem_limit: 16G
    cpu_guarantee: 3
    cpu_limit: 4
    node_selector:
      node.kubernetes.io/instance-type: g4dn.xlarge
  default: true
gpu_2:
  display_name: 1 T4 GPU, ~8 CPUs, ~32GB of RAM
  kubespawner_override:
    mem_guarantee: 29G
    mem_limit: 32G
    cpu_guarantee: 6
    cpu_limit: 8
    node_selector:
      node.kubernetes.io/instance-type: g4dn.2xlarge

Hub 2: BICAN hub

The same as staging, just different name (bican).

The text was updated successfully, but these errors were encountered:

sgibson91 · 2024-03-22T16:19:46Z

Completed Phase 1. New AWS account exists. Quota increases were automatically sent from the request template, though I think our templates ask for less quota than we are given by default judging by this response in freshdesk https://2i2c.freshdesk.com/a/tickets/1434

yuvipanda · 2024-03-22T20:15:44Z

Thanks for pointing out the quota setup, @sgibson91. I've handled that in #3780 (comment). I'll amend our documentation now to match.

yuvipanda · 2024-03-22T21:20:50Z

I've also opened #3834 to cross-link the GPU work at cluster creation time, so that's set up easily.

yuvipanda · 2024-03-22T21:33:17Z

Now that #3834 (comment) is merged, I've removed the explicit pointer to the GPU docs from the issue directly.

yuvipanda · 2024-03-23T00:09:20Z

#3836 also clarifies the current situation with quotas.

sgibson91 · 2024-03-25T16:40:40Z

A process note (more documentation):

If we want each phase to be self-contained and actionable by separate engineers if necessary, then I think the following section of the new cluster docs should be moved into the the new hub docs, as I am not creating any hub files at this time.

https://infrastructure.2i2c.org/hub-deployment-guide/new-cluster/aws/#export-the-efs-ip-address-for-home-directories

The same may be said for the two sections that follow:

Add cluster to be automatically deployed: Probably doesn't need to happen until at least the support chart is deployed (but I guess we're bundling cluster ad support turn-up together, which is fine)
- Edited to add: This definitely doesn't need to happen until hubs exist since the change to the workflow file is to add a failure variable in case staging deploy fails and prod deploy doesn't go ahead. Workflow will still work and deploy support without the edit this section of the docs is specifically referring to.
https://infrastructure.2i2c.org/hub-deployment-guide/new-cluster/aws/#a-note-on-the-support-chart-for-aws-clusters : AWS-specific note for the support chart
- Edited to add: This is already present in the support turn-up docs so I think can be removed from here
- Addressed in Minor doc fixes about setting up AWS clusters #3839

I think these are all here because they're AWS-specific and it was easier at the time they were written, but now we could use synced panels to show/hide cloud-vendor-specific info at the appropriate times, like we do on the GCP/Azure cluster setup docs

sgibson91 · 2024-03-25T17:04:21Z

Phase 2 now complete and ready for review in PR #3840

sgibson91 · 2024-03-25T17:23:02Z

I opened #3839 with some docs updates as I went through, and addressed link (3) from #3827 (comment).

Links (1) and (2) I didn't really know where to move them too right now, especially considering the new hub docs are not as "run this command, then do this thing" based as the new cluster docs are. Perhaps the engineer who completes Phase 3 will have a better inclination of where in the docs those sections should live.

yuvipanda · 2024-03-26T02:38:22Z

Thanks @sgibson91! I'll try to incorporate those changes into various places.

Ref 2i2c-org#3827

Earlier was just defaulting the first profile item to lab. I'll add this to the spec on the image. Ref 2i2c-org#3827 Ref 2i2c-org#3824

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Mar 21, 2024

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Mar 21, 2024

haroldcampbell assigned yuvipanda and sgibson91 Mar 21, 2024

yuvipanda mentioned this issue Mar 22, 2024

Tune AWS management accounts default quota requests ? #3780

Closed

2 tasks

sgibson91 mentioned this issue Mar 25, 2024

New AWS cluster: BICAN #3840

Merged

sgibson91 mentioned this issue Mar 25, 2024

Minor doc fixes about setting up AWS clusters #3839

Merged

yuvipanda mentioned this issue Mar 26, 2024

Post setup customizations for the MIT Hubs #3844

Closed

sgibson91 removed their assignment Mar 26, 2024

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 2, 2024

Add BICAN hub

b823c56

Ref 2i2c-org#3827

yuvipanda mentioned this issue Apr 2, 2024

Add BICAN hub #3891

Merged

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 2, 2024

DANDI & BICAN: Default interface to lab

56b0d2a

Earlier was just defaulting the first profile item to lab. I'll add this to the spec on the image. Ref 2i2c-org#3827 Ref 2i2c-org#3824

yuvipanda mentioned this issue Apr 2, 2024

DANDI & BICAN: Default interface to lab & enable named servers #3898

Merged

yuvipanda closed this as completed Apr 3, 2024

github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Apr 3, 2024

yuvipanda mentioned this issue Apr 4, 2024

Update “new hub” issue template to follow new format #3908

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Hub] BICAN (MIT Brain) #3827

[New Hub] BICAN (MIT Brain) #3827

yuvipanda commented Mar 21, 2024 •

edited

Loading

sgibson91 commented Mar 22, 2024

yuvipanda commented Mar 22, 2024

yuvipanda commented Mar 22, 2024

yuvipanda commented Mar 22, 2024

yuvipanda commented Mar 23, 2024

sgibson91 commented Mar 25, 2024 •

edited

Loading

sgibson91 commented Mar 25, 2024 •

edited

Loading

sgibson91 commented Mar 25, 2024

yuvipanda commented Mar 26, 2024

[New Hub] BICAN (MIT Brain) #3827

[New Hub] BICAN (MIT Brain) #3827

Comments

yuvipanda commented Mar 21, 2024 • edited Loading

Process Note

Phase 1: Account setup (READY)

Phase 2: Cluster setup (READY)

Phase 3 : Hub setup (READY)

Hub 1: Staging

Phase 3.1: Initial setup

Phase 3.2: Authentication

Phase 3.3: Object storage access

Phase 3.4: Profile List

Environments

Resource Allocations

CPU

GPU

Hub 2: BICAN hub

sgibson91 commented Mar 22, 2024

yuvipanda commented Mar 22, 2024

yuvipanda commented Mar 22, 2024

yuvipanda commented Mar 22, 2024

yuvipanda commented Mar 23, 2024

sgibson91 commented Mar 25, 2024 • edited Loading

sgibson91 commented Mar 25, 2024 • edited Loading

sgibson91 commented Mar 25, 2024

yuvipanda commented Mar 26, 2024

yuvipanda commented Mar 21, 2024 •

edited

Loading

sgibson91 commented Mar 25, 2024 •

edited

Loading

sgibson91 commented Mar 25, 2024 •

edited

Loading