Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Hub] BICAN (MIT Brain) #3827

Closed
yuvipanda opened this issue Mar 21, 2024 · 9 comments
Closed

[New Hub] BICAN (MIT Brain) #3827

yuvipanda opened this issue Mar 21, 2024 · 9 comments
Assignees

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Mar 21, 2024

Copied over from https://github.com/2i2c-org/meta/issues/913

Process Note

I'm using this as a way to try to rejig our new hub request process. See https://github.com/2i2c-org/meta/issues/897 (particularly https://github.com/2i2c-org/meta/issues/897#issuecomment-2010984904) for more information.

https://miro.com/app/board/uXjVNjUP3iQ=/, describes the various 'phases' of new hub turn-up. Each phase will be marked as "READY" or "NOT READY" when all information needed for it is available. Each section should also link to an appropriate runbook.

There will be customizations after this is all set up, but this is pathway towards a standardized hub turn up.

Phase 1: Account setup (READY)

This is applicable for cases where this is a dedicated cluster. The following table lists the information before this phase can start.

Question Answer
Cloud Provider AWS
Will 2i2c pay for cloud costs? Yes
Name of cloud account bican

Appropriate runbook: https://infrastructure.2i2c.org/hub-deployment-guide/cloud-accounts/new-aws-account/

Phase 2: Cluster setup (READY)

This assumes all engineers have access to this new account, and will be able to set up the cluster + support, without any new hubs being set up.

Question Answer
Region / Zone of the cluster us-east-2
Name of cluster bican
Is GPU required? yes

Appropriate runbooks:

Phase 3 : Hub setup (READY)

There's going to be a number of hubs, and this starts specifying them.

Hub 1: Staging

Phase 3.1: Initial setup

Question Answer Notes
Name of the hub staging
Dask gateway? no
Splash image https://static.wixstatic.com/media/c57c96_a9e46e43008349c8b65dcacc3ceaba35~mv2.png/v1/fill/w_402,h_150,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/_BICAN-full-logo-final.png
URL https://www.portal.brain-bican.org/

Phase 3.2: Authentication

Question Answer
Authentication Mechanism GitHub (via GitHubOAuthenticator)
Org based access? No
Admin Users @kabilar, @aaronkanzer, @asmacdo, @satra, @djarecka

Phase 3.3: Object storage access

Question Answer Notes
Scratch bucket enabled? Yes
Persistent bucket enabled? no
Requestor pays requests to external buckets allowed? no

Phase 3.4: Profile List

This was derived from looking at https://github.com/dandi/dandi-hub/blob/dandi/config.yaml.j2#L138-L210 and adopting to match our standards.

Environments
Display Name Description Overrides Resource Allocation Choices
DANDI (CPU) Default DANDI image with JupyterLab
image: dandiarchive/dandihub:latest
image_pull_policy: Always
CPU (see below)
DANDI Matlab (CPU) DANDI image with MATLAB. Requires you to bring your own license
image: dandiarchive/dandihub:latest-matlab
image_pull_policy: Always
CPU
DANDI (GPU) DANDI image with JupyterLab and GPU support
image: dandiarchive/dandihub:latest-gpu
image_pull_policy: Always
extra_resource_limits:
nvidia.com/gpu: 1
GPU
DANDI Matlab (GPU) DANDI Matlab image with GPU support. Requires you to bring your own license.

image: dandiarchive/dandihub:latest-gpu-matlab
image_pull_policy: Always
extra_resource_limits:
nvidia.com/gpu: 1
GPU
Resource Allocations
CPU

Generated by deployer generate resource-allocation choices r5.xlarge --num-allocations 4

mem_3_7:
  display_name: 3.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 3982682624
    mem_limit: 3982682624
    cpu_guarantee: 0.46875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
  default: true
mem_7_4:
  display_name: 7.4 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 7965365248
    mem_limit: 7965365248
    cpu_guarantee: 0.9375
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_14_8:
  display_name: 14.8 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 15930730496
    mem_limit: 15930730496
    cpu_guarantee: 1.875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_29_7:
  display_name: 29.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 31861460992
    mem_limit: 31861460992
    cpu_guarantee: 3.75
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_60_6:
  display_name: 60.6 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 65094813696
    mem_limit: 65094813696
    cpu_guarantee: 7.86
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_121_2:
  display_name: 121.2 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 130189627392
    mem_limit: 130189627392
    cpu_guarantee: 15.72
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_244_9:
  display_name: 244.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 263005526016
    mem_limit: 263005526016
    cpu_guarantee: 31.8
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
mem_489_9:
  display_name: 489.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 526011052032
    mem_limit: 526011052032
    cpu_guarantee: 63.6
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
GPU

Manually set up, but should be autogenerated

gpu_1:
  display_name: 1 T4 GPU, ~4 CPUs, ~16GB of RAM
  kubespawner_override:
    mem_guarantee: 14G
    mem_limit: 16G
    cpu_guarantee: 3
    cpu_limit: 4
    node_selector:
      node.kubernetes.io/instance-type: g4dn.xlarge
  default: true
gpu_2:
  display_name: 1 T4 GPU, ~8 CPUs, ~32GB of RAM
  kubespawner_override:
    mem_guarantee: 29G
    mem_limit: 32G
    cpu_guarantee: 6
    cpu_limit: 8
    node_selector:
      node.kubernetes.io/instance-type: g4dn.2xlarge

Hub 2: BICAN hub

The same as staging, just different name (bican).

@sgibson91
Copy link
Member

Completed Phase 1. New AWS account exists. Quota increases were automatically sent from the request template, though I think our templates ask for less quota than we are given by default judging by this response in freshdesk https://2i2c.freshdesk.com/a/tickets/1434

@yuvipanda
Copy link
Member Author

Thanks for pointing out the quota setup, @sgibson91. I've handled that in #3780 (comment). I'll amend our documentation now to match.

@yuvipanda
Copy link
Member Author

I've also opened #3834 to cross-link the GPU work at cluster creation time, so that's set up easily.

@yuvipanda
Copy link
Member Author

Now that #3834 (comment) is merged, I've removed the explicit pointer to the GPU docs from the issue directly.

@yuvipanda
Copy link
Member Author

#3836 also clarifies the current situation with quotas.

@sgibson91
Copy link
Member

sgibson91 commented Mar 25, 2024

A process note (more documentation):

If we want each phase to be self-contained and actionable by separate engineers if necessary, then I think the following section of the new cluster docs should be moved into the the new hub docs, as I am not creating any hub files at this time.

  1. https://infrastructure.2i2c.org/hub-deployment-guide/new-cluster/aws/#export-the-efs-ip-address-for-home-directories

The same may be said for the two sections that follow:

  1. Add cluster to be automatically deployed: Probably doesn't need to happen until at least the support chart is deployed (but I guess we're bundling cluster ad support turn-up together, which is fine)
    • Edited to add: This definitely doesn't need to happen until hubs exist since the change to the workflow file is to add a failure variable in case staging deploy fails and prod deploy doesn't go ahead. Workflow will still work and deploy support without the edit this section of the docs is specifically referring to.
  2. https://infrastructure.2i2c.org/hub-deployment-guide/new-cluster/aws/#a-note-on-the-support-chart-for-aws-clusters : AWS-specific note for the support chart

I think these are all here because they're AWS-specific and it was easier at the time they were written, but now we could use synced panels to show/hide cloud-vendor-specific info at the appropriate times, like we do on the GCP/Azure cluster setup docs

@sgibson91
Copy link
Member

sgibson91 commented Mar 25, 2024

Phase 2 now complete and ready for review in PR #3840

@sgibson91
Copy link
Member

I opened #3839 with some docs updates as I went through, and addressed link (3) from #3827 (comment).

Links (1) and (2) I didn't really know where to move them too right now, especially considering the new hub docs are not as "run this command, then do this thing" based as the new cluster docs are. Perhaps the engineer who completes Phase 3 will have a better inclination of where in the docs those sections should live.

@yuvipanda
Copy link
Member Author

Thanks @sgibson91! I'll try to incorporate those changes into various places.

@sgibson91 sgibson91 removed their assignment Mar 26, 2024
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 2, 2024
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 2, 2024
Earlier was just defaulting the first profile item to lab.
I'll add this to the spec on the image.

Ref 2i2c-org#3827
Ref 2i2c-org#3824
@github-project-automation github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

2 participants