Releases: aws/sagemaker-hyperpod-recipes
Releases · aws/sagemaker-hyperpod-recipes
v1.4.0
v1.3.3
Release v1.3.2
Release Notes - v1.3.2
What's Changed
New feature
- A random 5 character alphanumeric hash is appended to the end of recipe run names to prevent Kubernetes from prevent consecutive runs of the same recipe.
Bug fix
Release v1.3.1
Release Notes - v1.3.1
What's Changed
New recipes
- Added support for fine-tuning DeepSeek R1 671B using PEFT (LoRA and QLoRA).
- Added support for Wandb and MLFlow loggers.
All new recipes are listed under "Model Support" section of README.
Release v1.2.1
Release Notes - v1.2.1
What's Changed
- Add ability to force re-installation of the Nemo Adapter so experimental changes are correctly picked up by the job environment.
- Improve test coverage on the README.md file
- Improve test coverage on the recipes.
Notes
- Users should update to version 1.1.1 or later of the SageMaker HyperPod training adapter for NeMo when using this version of HyperPod recipes. Earlier versions may experience compatibility issues.
Release v1.2.0
Release Notes - v1.2.0
What's Changed
New recipes
- Added support for DeepSeek's family of distilled R1 models. Users can now finetune various sizes of DeepSeek-R1-Distill-Llama and DeepSeek-R1-Distill-Qwen using SFT and PEFT (lora/qlora).
All new recipes are listed under "Model Support" section of README.
Notes
- Users should update to version 1.1.0 or later of the SageMaker HyperPod training adapter for NeMo when using this version of HyperPod recipes. Earlier versions may experience compatibility issues.
Release v1.1.0
Release Notes - v1.1.0
What's Changed
New recipes
- Added support for Llama 3.1 70b and Mixtral 22b 128 node pre-training.
- Added support for Llama 3.3 fine-tuning with SFT and LoRA.
- Added support for Llama 405b 32k sequence length QLoRA fine-tuning.
All new recipes are listed under "Model Support" section of README.
Release v1.0.1
Release Notes - v1.0.1
What's Changed
Bug fixes
- Upgraded Transformers library in the enroot Slurm code path to support running Llama3.2 recipes with an enroot container
Hyperpod Enhancements
- Added support for additional Hyperpod instance types including p5e and g6
Release v1.0.0
Release Notes - v1.0.0
We're thrilled to announce the initial release of sagemaker-hyperpod-recipes!
🎉 Features
- Unified Job Submission: Submit training and fine-tuning workflows to SageMaker HyperPod or SageMaker training jobs using a single entry point
- Flexible Configuration: Customize your training jobs with three types of configuration files:
- General Configuration (ex:
recipes_collection/config.yaml
) - Cluster Configuration (ex:
recipes_collection/cluster/slurm.yaml
) - Recipe Configuration (ex:
recipes_collection/recipes/training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain.yaml
)
- General Configuration (ex:
- Pre-defined LLM Recipes: Access a collection of ready-to-use recipes for training Large Language Models
- Cluster Agnostic: Compatible with SageMaker HyperPod (with Slurm or Amazon EKS orchestrators) and SageMaker training jobs
- Built on Nvidia NeMo Framework: Leverages the Nvidia NeMo Framework Launcher for efficient job management
🗂️ Repository Structure
main.py
: Primary entry point for submitting training jobslauncher_scripts/
: Collection of commonly used scripts for LLM trainingrecipes_collection/
: Pre-defined LLM recipes provided by developers
🔧 Key Components
- General Configuration: Common settings like default parameters and environment variables
- Cluster Configuration: Cluster-specific settings (e.g., volume, label for Kubernetes; job name for Slurm)
- Recipe Configuration: Training job settings including model types, sharding degree, and dataset paths
📚 Documentation
- Refer to the
README.md
for detailed usage instructions and examples
🤝 Contributing
We welcome contributions to enhance the capabilities of sagemaker-hyperpod-recipes. Please refer to our contributing guidelines for more information.
Thank you for choosing sagemaker-hyperpod-recipes for your large-scale language model training needs!