Checkpointing support for transformer type models #247

zhenghh04 · 2025-02-05T22:40:30Z

In this PR, we addressed the issue that people have to manually input layer parameters and optimization groups in the checkpointing. #248

Now, people are allowed to input just "vocab_size, hidden_size, ffn_hidden_size, num_layers", the layer parameters and optimization groups are calculated internally.
We also added option for specifying zero_stage. The default value is -1, where no zero is used. Data parallel instance 0 will checkpoint the whole data.
We added several YAML configuration files for Llama transformer models.
We allow people to specify the datatype for writing data.

zhenghh04 · 2025-02-08T05:23:34Z

@hariharan-devarajan This PR is ready for review.

zhenghh04 and others added 24 commits February 5, 2025 13:47

restructure config for transformer model

26c8170

added support for checkpointing transformer models

d77608d

fixed bug

f74708f

fixed bug

4c9302e

added llama7b model

0676625

restructure config

ee65c6a

added llama 7b

c024784

bug fix

84e8295

added doc

13e6cda

fix mem issue

13a05fe

fix mem issue

139eed7

fixed mem issue

02b31e9

fixed 0 layers

d25d730

added llama models

58c220e

fixing test

0c86f71

fixed bugs

7f427b4

added support for dtype

e6a4697

fixed issue with datatype

a4372cd

accommodate the old way

656edc9

modified checkpoint

c4e0218

added fsync support

1c95a35

fixed tests

05f5d42

fixed issue for testing checkpoints

6ae498e

fixing tests

2499e07

zhenghh04 added the enhancement New feature or request label Feb 7, 2025

zhenghh04 added 5 commits February 6, 2025 20:19

fixed tests for number of files calculation

841f406

fixed documentation action

9c160f7

added checkpoint throughput calculation

6fede9c

added metric

49d3f47

improving the doc

843cabb

zhenghh04 requested a review from hariharan-devarajan February 7, 2025 04:38

zhenghh04 marked this pull request as ready for review February 7, 2025 04:38

zhenghh04 added 7 commits February 6, 2025 23:20

fixed datatype issue

9690df2

added checkpoint size in the final report

178429e

fixed docs

ada7fee

fixed docs

b46ea8f

fixed tests

da2474c

added zero3 config

f620e0f

added zero3 config files and improved matrics

2800be3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing support for transformer type models #247

Checkpointing support for transformer type models #247

zhenghh04 commented Feb 5, 2025 •

edited

Loading

zhenghh04 commented Feb 8, 2025

Checkpointing support for transformer type models #247

Are you sure you want to change the base?

Checkpointing support for transformer type models #247

Conversation

zhenghh04 commented Feb 5, 2025 • edited Loading

zhenghh04 commented Feb 8, 2025

zhenghh04 commented Feb 5, 2025 •

edited

Loading