Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing support for transformer type models #247

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

zhenghh04
Copy link
Member

@zhenghh04 zhenghh04 commented Feb 5, 2025

In this PR, we addressed the issue that people have to manually input layer parameters and optimization groups in the checkpointing. #248

  • Now, people are allowed to input just "vocab_size, hidden_size, ffn_hidden_size, num_layers", the layer parameters and optimization groups are calculated internally.
  • We also added option for specifying zero_stage. The default value is -1, where no zero is used. Data parallel instance 0 will checkpoint the whole data.
  • We added several YAML configuration files for Llama transformer models.
  • We allow people to specify the datatype for writing data.

@zhenghh04 zhenghh04 added the enhancement New feature or request label Feb 7, 2025
@zhenghh04 zhenghh04 marked this pull request as ready for review February 7, 2025 04:38
@zhenghh04
Copy link
Member Author

@hariharan-devarajan This PR is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant