Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update organization and tag to V1 #150

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Update organization and tag to V1 #150

wants to merge 1 commit into from

Conversation

perifaws
Copy link
Contributor

Addresses #149

Update naming for Jax, update naming for single digit examples, new directory for best practices (EFA cheat sheet was in architectures).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mhuguesaws
Copy link
Contributor

mhuguesaws commented Feb 22, 2024

Please stop numbering things. This is useless and will create plenty of challenges to add and remove things.

Why AMI is 1. and container 2. what's the logic?
efa_version.sh in 5 best practices. So best practices come last?

@mhuguesaws
Copy link
Contributor

Here is my proposal

- docs/
- core_infra/
- orchestrators
   - aws-parallelcluster/
   - sagemaker-hyperpod/
      - slurm/
          - lifecycle-scripts/
   - amazon-eks/
   - aws-batch/
- observability/
- ml-frameworks/
   - [FRAMEWORK_NAME]
       - slurm/
       - kubernetes/
       - Dockerfile
       - README
- ml-micro-benchmarks/
- infra-validation/

@mhuguesaws
Copy link
Contributor

@awsankur if you can comment here.

@awsankur
Copy link
Contributor

I like this structure. A couple of comments:

  1. Observability solution will depend on the orchestrator. So we should have an observability section as part of each orchestrator. Ideally, we should be in a position where observability is automatically enabled when we build a cluster or can be enabled in a few steps when we have built a cluster.
  2. I am assuming [FRAMEWORK_NAME] is a test case in our current structure. I think we can bring more clarity here. So we can have a given set of [FRAMEWORK_NAMES] which include:
    a. Nvidia [Nemo, Nemo-Multimodal, BioNemo etc]...we can add DALI and MONAI in the future
    b. MosaicML [MPT etc]
    c. PyTorch [DDP, FSDP, etc]
    d. SM [DataParallel, Model Parallel, FSDP, etc]
    d. TensorFlow
    e. JAX

Within each FRAMEWORK_NAME we can have Dockerfiles, sbatch scripts and kubernetes yaml and other necessary files for each model name

Thoughts?

@mhuguesaws
Copy link
Contributor

I like this structure. A couple of comments:

  1. Observability solution will depend on the orchestrator. So we should have an observability section as part of each orchestrator. Ideally, we should be in a position where observability is automatically enabled when we build a cluster or can be enabled in a few steps when we have built a cluster.
  2. I am assuming [FRAMEWORK_NAME] is a test case in our current structure. I think we can bring more clarity here. So we can have a given set of [FRAMEWORK_NAMES] which include:
    a. Nvidia [Nemo, Nemo-Multimodal, BioNemo etc]...we can add DALI and MONAI in the future
    b. MosaicML [MPT etc]
    c. PyTorch [DDP, FSDP, etc]
    d. SM [DataParallel, Model Parallel, FSDP, etc]
    d. TensorFlow
    e. JAX

Within each FRAMEWORK_NAME we can have Dockerfiles, sbatch scripts and kubernetes yaml and other necessary files for each model name

Thoughts?

Love 2. organize by "vendor"

For 1. don't think we'll go outside grafana+prometheus at this point. We can organize the observability section by orchestrator for since the deployment and setup will be different.

@perifaws
Copy link
Contributor Author

@mhuguesaws how about CloudWatch or profilers like Nsight?

@mhuguesaws
Copy link
Contributor

@mhuguesaws how about CloudWatch or profilers like Nsight?

Profiler in profiler ;)

@awsankur
Copy link
Contributor

We should add Nsight

@mhuguesaws
Copy link
Contributor

We should add Nsight

profilers.

@KeitaW
Copy link
Collaborator

KeitaW commented Mar 11, 2024

I was wondering if observability should be under orchestrators or have subdirectories per orchestrators.

@KeitaW KeitaW force-pushed the reorganization branch 2 times, most recently from 11abd11 to a6ffefc Compare June 4, 2024 02:26
@KeitaW KeitaW force-pushed the main branch 2 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants