Skip to content

Commit 45a4c13

Browse files
authored
Merge pull request #26 from aws/1.3.1
Sagemaker Hyperpod Recipes Release 1.3.1
2 parents 1a1b4f6 + ffaa5bb commit 45a4c13

File tree

91 files changed

+676
-95
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

91 files changed

+676
-95
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Amazon SageMaker HyperPod recipes include built-in support for:
1717
- Fine-tuning: Full, QLoRA, LoRA
1818
- AWS Instances: ml.p5.48xlarge, ml.p4d.24xlarge, and ml.trn1.32xlarge instance families
1919
- Supported Models: DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, Mixtral models
20-
- Model Evaluation: Tensorboard
20+
- Model Evaluation: [Tensorboard](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.tensorboard.html#module-lightning.pytorch.loggers.tensorboard), [MLflow](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html), [Wandb](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.WandbLogger.html) - feel free to add any key word arguments to the Logger classes by using their associated kwargs config
2121

2222
###### ***Note: For DeepSeek R1 671b customers must ensure that their model repository contains weights of type bf16. DeepSeek's [HuggingFace repository](https://huggingface.co/deepseek-ai/DeepSeek-R1) contains the model in dtype fp8 by default. In order to convert a model repository from fp8 to bf16 we recommend using [this script](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/blob/main/src/hyperpod_nemo_adapter/scripts/fp8_cast_bf16.py) and pointing your recipe to the output directory.
2323

@@ -161,7 +161,7 @@ employing the `enroot` command. Please refer to the following documentation on b
161161
```bash
162162
REGION="us-west-2"
163163
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:${TAG}"
164-
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 855988369404.dkr.ecr.${REGION}.amazonaws.com
164+
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
165165
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
166166
mv $PWD/smdistributed-modelparallel.sqsh "/fsx/smdistributed-modelparallel.sqsh"
167167
```

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_lora.yaml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode
@@ -136,4 +142,4 @@ model:
136142
# Profiling configs
137143
# Viztracer profiling options
138144
viztracer:
139-
enabled: True
145+
enabled: False

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora.yaml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode
@@ -40,7 +46,7 @@ exp_manager:
4046
auto_checkpoint:
4147
enabled: False
4248
export_full_model:
43-
# 671B LoRA does not support export_full_model.
49+
# 671B qLoRA does not support export_full_model.
4450
# Instead, use the merge-peft-checkpoint script after training.
4551
# Set every_n_train_steps = 0 to disable full checkpointing
4652
every_n_train_steps: 0
@@ -136,4 +142,4 @@ model:
136142
# Profiling configs
137143
# Viztracer profiling options
138144
viztracer:
139-
enabled: True
145+
enabled: False

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_70b_seq16k_gpu_fine_tuning.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_70b_seq16k_gpu_lora.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_70b_seq8k_gpu_fine_tuning.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_70b_seq8k_gpu_lora.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_8b_seq16k_gpu_fine_tuning.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_8b_seq16k_gpu_lora.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_8b_seq8k_gpu_fine_tuning.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ trainer:
2323
exp_manager:
2424
exp_dir: null
2525
name: experiment
26-
create_tensorboard_logger: True
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
2733
create_checkpoint_callback: True
2834
# Configs to save checkpoint with a fixed interval
2935
# Note: These config will not work with auto checkpoint mode

0 commit comments

Comments
 (0)