[TorchAcc][Experimental] Integrate more model in torchacc #683

Zhikaiiii · 2024-04-11T06:27:15Z

PR type

Bug Fix
New Feature
Document Updates
More Model or Dataset Support

PR information

Previous PR: #647

Integrate more model patch function for torchacc.
Support stat speed metrics for after some warmup steps(since there is compile time in the beginning of torchacc)

Experiment results

Paste your experiment result here(if needed).

We have test some models for torchacc and swift

llama2-13b

method	train_sample/s	train_sample/s after warmup
torchacc + 2fsdp	3.775	4.426(1.13x)
torchacc + 2ddp	4.997(1.28x)	5.416(1.38x)
swift + 2ddp	3.899	3.912

baichuan2-13b

method	train_sample/s	train_sample/s after warmup
torchacc + 2fsdp	5.014(1.32x)	6.039(1.60x)
torchacc + 2ddp	6.218(1.63x)	6.861(1.80x)
swift + 2ddp	3.812	3.815

chatglm3-6b

method	train_sample/s	train_sample/s after warmup
torchacc + 2fsdp	9.859(1.82x)	11.896(2.19x)
swift + 2ddp	5.431	-

yi-34b

method	train_sample/s	train_sample/s after warmup
torchacc + 4fsdp	2.349	2.978(1.24x)
swift + 2ddp + 2mp	2.411	2.411

llama3-8b

method	train_sample/s	train_sample/s after warmup
torchacc + 2ddp	9.569(1.17x)	10.593(1.30x)
swift + 2ddp	8.126	-

qwen1.5-14b

method	train_sample/s	train_sample/s after warmup
torchacc + 2ddp	5.293(1.07x)	5.765(1.17x)
swift + 2ddp	4.944	-

…hat. (modelscope#333)

* [TorchAcc] Fix batch split when padding_to is not None. * fix lint

…orchacc2

…hat. (modelscope#333)

…o rebase_baole

examples/pytorch/llm/scripts/torchacc/baichuan2_13b_chat/acc_lora_dp_sft.sh

examples/pytorch/llm/scripts/torchacc/baichuan2_13b_chat/acc_lora_fsdp_sft.sh

swift/llm/sft.py

Jintao-Huang · 2024-04-28T16:36:38Z

swift/llm/utils/argument.py

    logging_steps: int = 5
    dataloader_num_workers: int = 1
    dataloader_pin_memory: bool = True
+    dataloader_drop_last: bool = True


问一下这里为什么不使用training_args的默认值False呢

Jintao-Huang · 2024-04-28T16:38:04Z

swift/llm/sft.py

            logger.info(f'val_dataset_sample: {val_dataset_sample}')
            val_idxs = random_state.permutation(val_dataset_sample)
            val_dataset = val_dataset.select(val_idxs)
+    training_args.train_dataset_sample = train_dataset.shape[


这里是为什么呀

train_dataset_sample会在下个版本去除了，使用'{dataset_name}#{train_sample}|{val_sample}'来控制单个数据集的数量

train_dataset_sample会在下个版本去除了，使用'{dataset_name}#{train_sample}|{val_sample}'来控制单个数据集的数量

这边是需要获取一个总的train_dataset_sample用于后面warmup_step的计算。

而且这个pr的train_dataset_sample是基于Refactor dataset #802 得到的train_dataset的结果，给SwiftArgumentMixin的对应参数，和前面数据集的处理是独立的吧，不知道是否理解有误

Jintao-Huang · 2024-04-28T16:47:25Z

swift/llm/utils/argument.py

    neftune_alpha: Optional[float] = None
    deepspeed_config_path: Optional[str] = None
    model_cache_dir: Optional[str] = None
+    metric_warmup_step: Optional[float] = 0


这三个超参数可以解释一下是什么意思么，是一定需要在torch_acc的情况下有效么

可以注释一下只在torchacc下有效嘛

float 还是 int

这三个超参数可以解释一下是什么意思么，是一定需要在torch_acc的情况下有效么

可以注释一下只在torchacc下有效嘛

前两个应该是通用的，fsdp_num只在torchacc有效，我注释一下

Jintao-Huang · 2024-04-28T16:49:44Z

swift/torchacc_utils.py

-from typing import List
+from typing import List, Optional, Tuple

+import einops


ms-swift没有这个依赖, 验证一下是否会导致报错

Jintao-Huang · 2024-04-28T16:50:59Z

swift/trainers/arguments.py

        default='token', metadata={'choices': ['token', 'sentence']})
    additional_saved_files: Optional[List[str]] = None
+    metric_warmup_step: Optional[float] = 0
+    train_dataset_sample: Optional[int] = -1


同, train_dataset_sample将会在下个版本移除，可能产生影响

Jintao-Huang · 2024-04-28T16:53:56Z

swift/trainers/callback.py

               logs=None,
               **kwargs):
        logs['global_step'] = state.global_step
+        if state.global_step >= self.metric_warmup_step and self.warmup_start_time == 0:


可以解释一下这里的逻辑么

可以解释一下这里的逻辑么

同上

Jintao-Huang · 2024-04-28T16:55:43Z

swift/trainers/callback.py

            self.training_bar = tqdm(
                desc='Train', total=state.max_steps, dynamic_ncols=True)
        self.current_step = 0
+        self.warmup_start_time = 0


可以解释一下这里嘛

可以解释一下这里嘛

这里是因为torchacc在初始的阶段需要进行图编译，导致训练速度比后期稳定慢很多，因此加入了这个warmup_step，表示从训练的第warm_up step开始再计算一个训练的平均速度。

这里的计算逻辑是如果当前step到达了warm_up step，调用transformers的speed_metric函数计算一个指标并进行更新。

args.metric_warmup_step可以是int或者float，表示具体的步数或者比例

Jintao-Huang · 2024-04-28T17:00:09Z

swift/llm/sft.py

    logger.info(f'The logging file will be saved in: {logging_path}')
-    trainer.train(training_args.resume_from_checkpoint)
+
+    if args.use_profiler:


可以介绍一下这里的逻辑嘛，或者使用环境变量隔离

可以介绍一下这里的逻辑嘛，或者使用环境变量隔离

这里是加了对训练的profiler功能，和使用torchacc是独立的

…3_paligemma * commit '20bc628746772836fe3838e16e87fb27c39b5ec8': fix val_dataset (modelscope#992) update custom_val_dataset (modelscope#991) [TorchAcc][Experimental] Integrate more model in torchacc (modelscope#683) fix cpu 'torch._C' has no attribute '_cuda_resetPeakMemoryStats' (modelscope#914) refactor readme web-ui (modelscope#983) support transformers==4.41 (modelscope#979) support more models (modelscope#971)

* main: (23 commits) fix gr limit (modelscope#1016) fix minicpm-v (modelscope#1010) fix cogvlm2 history (modelscope#1005) 更新了Command-line-parameters.md里面的一个链接 (modelscope#1001) fix template example copy (modelscope#1003) Feat/phi3 paligemma (modelscope#998) fix pt deploy lora (modelscope#999) fix args (modelscope#996) fix val_dataset (modelscope#992) update custom_val_dataset (modelscope#991) [TorchAcc][Experimental] Integrate more model in torchacc (modelscope#683) fix cpu 'torch._C' has no attribute '_cuda_resetPeakMemoryStats' (modelscope#914) refactor readme web-ui (modelscope#983) support transformers==4.41 (modelscope#979) support more models (modelscope#971) Fix minicpm device map (modelscope#978) fix typing (modelscope#974) fix vllm eos_token (modelscope#973) Support minicpm-v-v2_5-chat (modelscope#970) support cogvlm2-en-chat-19b (modelscope#967) ...

…#683)

baoleai and others added 30 commits January 25, 2024 14:09

[TorchAcc] Integrate TorchAcc and provide a sft example of qwen-72b-c…

e2d3b44

…hat. (modelscope#333)

Enhance TorchAcc support for dynamic sequence. (modelscope#382)

6c899c9

[TorchAcc] Add support for save/load checkpoint. (modelscope#444)

1321592

baichuan_patch

dba0c65

patch baichuan

ef6e2d6

modify baichuan

432c070

Merge branch 'torchacc' into torchacc2

1d6f719

[TorchAcc] Fix batch split when padding_to is not None. (modelscope#480)

a5a6fdc

* [TorchAcc] Fix batch split when padding_to is not None. * fix lint

Merge branch 'torchacc' of https://github.com/modelscope/swift into t…

37c4787

…orchacc2

metric warmup calculate

32cc090

fix conflict

64246d3

fix

290c2dd

model patch

7e6b197

add profiler

f0c7c8a

add yi

c8dbfc6

[TorchAcc] Integrate TorchAcc and provide a sft example of qwen-72b-c…

669f5d9

…hat. (modelscope#333)

Enhance TorchAcc support for dynamic sequence. (modelscope#382)

b140759

[TorchAcc] Add support for save/load checkpoint. (modelscope#444)

6faa7b3

fix patch

1c3a258

fix lint

160a9d5

code clean

e0fe1d4

add argument:fsdp num

0bb1797

[TorchAcc] rebase master

f03aa00

[TorchAcc] Integrate TorchAcc and provide a sft example of qwen-72b-c…

661def1

…hat. (modelscope#333)

Enhance TorchAcc support for dynamic sequence. (modelscope#382)

da6c94a

[TorchAcc] Add support for save/load checkpoint. (modelscope#444)

73a843a

fix patch

ee012b1

fix lint

f1b19a6

code clean

0457fa4

fix comments

d10901f

baoleai and others added 9 commits April 9, 2024 10:46

rebase

30ad8c8

clean code

cd6e799

Merge remote-tracking branch 'origin_balole/features/rebase_0401' int…

4400ea5

…o rebase_baole

clean code

f92274c

Merge remote-tracking branch 'origin/main' into rebase_acc

8e3cf24

format code

8ee4bbf

[fix]add mark_step to optimize speed

c3284ed

add script

e38fc2e

add torchacc trim graph

aa61d6f

baoleai reviewed Apr 28, 2024

View reviewed changes

baoleai requested review from Jintao-Huang and tastelikefeet April 28, 2024 07:20

Zhikaiiii added 5 commits April 28, 2024 15:33

remove useless code

40d18e9

remove useless files

0a173f1

add qwen72b full script

6226edb

Merge branch 'main' into rebase_acc

5da5649

fix bugs

6d68d29

Jintao-Huang reviewed Apr 28, 2024

View reviewed changes

Zhikaiiii added 5 commits May 17, 2024 11:19

qwen15 and llama3 support

a508e21

Merge branch 'main' into rebase_acc

bf2a440

remove prof callback

c5c310a

fix default value and add switch

bd8d072

update script

df84f3f

Jintao-Huang approved these changes May 22, 2024

View reviewed changes

Jintao-Huang merged commit fdb7a4d into modelscope:main May 22, 2024

hjh0119 pushed a commit to hjh0119/swift that referenced this pull request Jul 22, 2024

[TorchAcc][Experimental] Integrate more model in torchacc (modelscope…

d052932

…#683)

[TorchAcc][Experimental] Integrate more model in torchacc #683

[TorchAcc][Experimental] Integrate more model in torchacc #683

Uh oh!

Conversation

Zhikaiiii commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Zhikaiiii commented Apr 11, 2024 •

edited

Loading