Skip to content

Releases: aws/sagemaker-hyperpod-recipes

v1.4.0

16 Jul 14:25
3745a4c
Compare
Choose a tag to compare

Release Notes - v1.4.0

What's Changed

New feature

  • Added recipes for nova model customization

v1.3.3

22 Apr 19:56
9a3ad87
Compare
Choose a tag to compare

Release Notes - v1.3.3

What's Changed

New feature

  • Llama 4 Scout single-node and multi-node LoRA finetuning scripts.

Release v1.3.2

06 Mar 21:17
4417743
Compare
Choose a tag to compare

Release Notes - v1.3.2

What's Changed

New feature

  • A random 5 character alphanumeric hash is appended to the end of recipe run names to prevent Kubernetes from prevent consecutive runs of the same recipe.

Bug fix

  • Fix Issue #27: Address incorrect rendering of the pod affinity policy 'requiredDuringSchedulingIgnoredDuringExecution' in training.yaml. This issue affects pod scheduling behavior in Kubernetes clusters. (PR #28)

Release v1.3.1

26 Feb 02:03
45a4c13
Compare
Choose a tag to compare

Release Notes - v1.3.1

What's Changed

New recipes

  • Added support for fine-tuning DeepSeek R1 671B using PEFT (LoRA and QLoRA).
  • Added support for Wandb and MLFlow loggers.

All new recipes are listed under "Model Support" section of README.

Release v1.2.1

17 Feb 19:52
f9f37ea
Compare
Choose a tag to compare

Release Notes - v1.2.1

What's Changed

  • Add ability to force re-installation of the Nemo Adapter so experimental changes are correctly picked up by the job environment.
  • Improve test coverage on the README.md file
  • Improve test coverage on the recipes.

Notes

Release v1.2.0

01 Feb 01:02
f95303f
Compare
Choose a tag to compare

Release Notes - v1.2.0

What's Changed

New recipes

  • Added support for DeepSeek's family of distilled R1 models. Users can now finetune various sizes of DeepSeek-R1-Distill-Llama and DeepSeek-R1-Distill-Qwen using SFT and PEFT (lora/qlora).

All new recipes are listed under "Model Support" section of README.

Notes

Release v1.1.0

31 Dec 21:16
66e49e0
Compare
Choose a tag to compare

Release Notes - v1.1.0

What's Changed

New recipes

  • Added support for Llama 3.1 70b and Mixtral 22b 128 node pre-training.
  • Added support for Llama 3.3 fine-tuning with SFT and LoRA.
  • Added support for Llama 405b 32k sequence length QLoRA fine-tuning.

All new recipes are listed under "Model Support" section of README.

Release v1.0.1

24 Dec 01:45
5f8b472
Compare
Choose a tag to compare

Release Notes - v1.0.1

What's Changed

Bug fixes

  • Upgraded Transformers library in the enroot Slurm code path to support running Llama3.2 recipes with an enroot container

Hyperpod Enhancements

  • Added support for additional Hyperpod instance types including p5e and g6

Release v1.0.0

07 Dec 00:52
5c66df4
Compare
Choose a tag to compare

Release Notes - v1.0.0

We're thrilled to announce the initial release of sagemaker-hyperpod-recipes!

🎉 Features

  • Unified Job Submission: Submit training and fine-tuning workflows to SageMaker HyperPod or SageMaker training jobs using a single entry point
  • Flexible Configuration: Customize your training jobs with three types of configuration files:
    • General Configuration (ex: recipes_collection/config.yaml)
    • Cluster Configuration (ex: recipes_collection/cluster/slurm.yaml)
    • Recipe Configuration (ex: recipes_collection/recipes/training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain.yaml)
  • Pre-defined LLM Recipes: Access a collection of ready-to-use recipes for training Large Language Models
  • Cluster Agnostic: Compatible with SageMaker HyperPod (with Slurm or Amazon EKS orchestrators) and SageMaker training jobs
  • Built on Nvidia NeMo Framework: Leverages the Nvidia NeMo Framework Launcher for efficient job management

🗂️ Repository Structure

  • main.py: Primary entry point for submitting training jobs
  • launcher_scripts/: Collection of commonly used scripts for LLM training
  • recipes_collection/: Pre-defined LLM recipes provided by developers

🔧 Key Components

  1. General Configuration: Common settings like default parameters and environment variables
  2. Cluster Configuration: Cluster-specific settings (e.g., volume, label for Kubernetes; job name for Slurm)
  3. Recipe Configuration: Training job settings including model types, sharding degree, and dataset paths

📚 Documentation

  • Refer to the README.md for detailed usage instructions and examples

🤝 Contributing

We welcome contributions to enhance the capabilities of sagemaker-hyperpod-recipes. Please refer to our contributing guidelines for more information.

Thank you for choosing sagemaker-hyperpod-recipes for your large-scale language model training needs!