16 Jul 14:25

viclzhu

3745a4c

v1.4.0 Latest

Latest

Release Notes - v1.4.0

What's Changed

New feature

Added recipes for nova model customization

Assets 2

22 Apr 19:56

rohithn1

v1.3.3

9a3ad87

v1.3.3

Release Notes - v1.3.3

What's Changed

New feature

Llama 4 Scout single-node and multi-node LoRA finetuning scripts.

Assets 2

06 Mar 21:17

rohithn1

v1.3.2

4417743

Release v1.3.2

Release Notes - v1.3.2

What's Changed

New feature

A random 5 character alphanumeric hash is appended to the end of recipe run names to prevent Kubernetes from prevent consecutive runs of the same recipe.

Bug fix

Fix Issue #27: Address incorrect rendering of the pod affinity policy 'requiredDuringSchedulingIgnoredDuringExecution' in training.yaml. This issue affects pod scheduling behavior in Kubernetes clusters. (PR #28)

Assets 2

26 Feb 02:03

rohithn1

v1.3.1

45a4c13

Release v1.3.1

Release Notes - v1.3.1

What's Changed

New recipes

Added support for fine-tuning DeepSeek R1 671B using PEFT (LoRA and QLoRA).
Added support for Wandb and MLFlow loggers.

All new recipes are listed under "Model Support" section of README.

Assets 2

17 Feb 19:52

julianhr

v1.2.1

f9f37ea

Release v1.2.1

Release Notes - v1.2.1

What's Changed

Add ability to force re-installation of the Nemo Adapter so experimental changes are correctly picked up by the job environment.
Improve test coverage on the README.md file
Improve test coverage on the recipes.

Notes

Users should update to version 1.1.1 or later of the SageMaker HyperPod training adapter for NeMo when using this version of HyperPod recipes. Earlier versions may experience compatibility issues.

Assets 2

01 Feb 01:02

rohithn1

v1.2.0

f95303f

Release v1.2.0

Release Notes - v1.2.0

What's Changed

New recipes

Added support for DeepSeek's family of distilled R1 models. Users can now finetune various sizes of DeepSeek-R1-Distill-Llama and DeepSeek-R1-Distill-Qwen using SFT and PEFT (lora/qlora).

All new recipes are listed under "Model Support" section of README.

Notes

Users should update to version 1.1.0 or later of the SageMaker HyperPod training adapter for NeMo when using this version of HyperPod recipes. Earlier versions may experience compatibility issues.

Assets 2

31 Dec 21:16

rohithn1

v1.1.0

66e49e0

Release v1.1.0

Release Notes - v1.1.0

What's Changed

New recipes

Added support for Llama 3.1 70b and Mixtral 22b 128 node pre-training.
Added support for Llama 3.3 fine-tuning with SFT and LoRA.
Added support for Llama 405b 32k sequence length QLoRA fine-tuning.

All new recipes are listed under "Model Support" section of README.

Assets 2

24 Dec 01:45

jessech-en

v1.0.1

5f8b472

Release v1.0.1

Release Notes - v1.0.1

What's Changed

Bug fixes

Upgraded Transformers library in the enroot Slurm code path to support running Llama3.2 recipes with an enroot container

Hyperpod Enhancements

Added support for additional Hyperpod instance types including p5e and g6

Assets 2

07 Dec 00:52

jessech-en

v1.0.0

5c66df4

Release v1.0.0

Release Notes - v1.0.0

We're thrilled to announce the initial release of sagemaker-hyperpod-recipes!

🎉 Features

Unified Job Submission: Submit training and fine-tuning workflows to SageMaker HyperPod or SageMaker training jobs using a single entry point
Flexible Configuration: Customize your training jobs with three types of configuration files:
- General Configuration (ex: recipes_collection/config.yaml)
- Cluster Configuration (ex: recipes_collection/cluster/slurm.yaml)
- Recipe Configuration (ex: recipes_collection/recipes/training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain.yaml)
Pre-defined LLM Recipes: Access a collection of ready-to-use recipes for training Large Language Models
Cluster Agnostic: Compatible with SageMaker HyperPod (with Slurm or Amazon EKS orchestrators) and SageMaker training jobs
Built on Nvidia NeMo Framework: Leverages the Nvidia NeMo Framework Launcher for efficient job management

🗂️ Repository Structure

main.py: Primary entry point for submitting training jobs
launcher_scripts/: Collection of commonly used scripts for LLM training
recipes_collection/: Pre-defined LLM recipes provided by developers

🔧 Key Components

General Configuration: Common settings like default parameters and environment variables
Cluster Configuration: Cluster-specific settings (e.g., volume, label for Kubernetes; job name for Slurm)
Recipe Configuration: Training job settings including model types, sharding degree, and dataset paths

📚 Documentation

Refer to the README.md for detailed usage instructions and examples

🤝 Contributing

We welcome contributions to enhance the capabilities of sagemaker-hyperpod-recipes. Please refer to our contributing guidelines for more information.

Thank you for choosing sagemaker-hyperpod-recipes for your large-scale language model training needs!

Assets 2

Releases: aws/sagemaker-hyperpod-recipes

v1.4.0

Release Notes - v1.4.0

What's Changed

New feature

Uh oh!

v1.3.3

Release Notes - v1.3.3

What's Changed

New feature

Uh oh!

Release v1.3.2

Release Notes - v1.3.2

What's Changed

New feature

Bug fix

Uh oh!

Release v1.3.1

Release Notes - v1.3.1

What's Changed

New recipes

Uh oh!

Release v1.2.1

Release Notes - v1.2.1

What's Changed

Notes

Uh oh!

Release v1.2.0

Release Notes - v1.2.0

What's Changed

New recipes

Notes

Uh oh!

Release v1.1.0

Release Notes - v1.1.0

What's Changed

New recipes

Uh oh!

Release v1.0.1

Release Notes - v1.0.1

What's Changed

Bug fixes

Hyperpod Enhancements

Uh oh!

Release v1.0.0

Release Notes - v1.0.0

🎉 Features

🗂️ Repository Structure

🔧 Key Components

📚 Documentation

🤝 Contributing

Uh oh!