[TorchAcc][Experimental] Integrate more model in torchacc#683
[TorchAcc][Experimental] Integrate more model in torchacc#683Jintao-Huang merged 49 commits intomodelscope:mainfrom
Conversation
* [TorchAcc] Fix batch split when padding_to is not None. * fix lint
examples/pytorch/llm/scripts/torchacc/baichuan2_13b_chat/acc_lora_dp_sft.sh
Outdated
Show resolved
Hide resolved
examples/pytorch/llm/scripts/torchacc/baichuan2_13b_chat/acc_lora_dp_sft.sh
Outdated
Show resolved
Hide resolved
examples/pytorch/llm/scripts/torchacc/baichuan2_13b_chat/acc_lora_fsdp_sft.sh
Outdated
Show resolved
Hide resolved
swift/llm/utils/argument.py
Outdated
| logging_steps: int = 5 | ||
| dataloader_num_workers: int = 1 | ||
| dataloader_pin_memory: bool = True | ||
| dataloader_drop_last: bool = True |
There was a problem hiding this comment.
问一下这里为什么不使用training_args的默认值False呢
swift/llm/sft.py
Outdated
| logger.info(f'val_dataset_sample: {val_dataset_sample}') | ||
| val_idxs = random_state.permutation(val_dataset_sample) | ||
| val_dataset = val_dataset.select(val_idxs) | ||
| training_args.train_dataset_sample = train_dataset.shape[ |
There was a problem hiding this comment.
train_dataset_sample会在下个版本去除了,使用'{dataset_name}#{train_sample}|{val_sample}'来控制单个数据集的数量
There was a problem hiding this comment.
train_dataset_sample会在下个版本去除了,使用'{dataset_name}#{train_sample}|{val_sample}'来控制单个数据集的数量
- 这边是需要获取一个总的train_dataset_sample用于后面warmup_step的计算。
- 而且这个pr的train_dataset_sample是基于Refactor dataset #802 得到的train_dataset的结果,给
SwiftArgumentMixin的对应参数,和前面数据集的处理是独立的吧,不知道是否理解有误
swift/llm/utils/argument.py
Outdated
| neftune_alpha: Optional[float] = None | ||
| deepspeed_config_path: Optional[str] = None | ||
| model_cache_dir: Optional[str] = None | ||
| metric_warmup_step: Optional[float] = 0 |
There was a problem hiding this comment.
这三个超参数可以解释一下是什么意思么,是一定需要在torch_acc的情况下有效么
可以注释一下只在torchacc下有效嘛
There was a problem hiding this comment.
这三个超参数可以解释一下是什么意思么,是一定需要在torch_acc的情况下有效么
可以注释一下只在torchacc下有效嘛
前两个应该是通用的,fsdp_num只在torchacc有效,我注释一下
swift/torchacc_utils.py
Outdated
| from typing import List | ||
| from typing import List, Optional, Tuple | ||
|
|
||
| import einops |
There was a problem hiding this comment.
ms-swift没有这个依赖, 验证一下是否会导致报错
| default='token', metadata={'choices': ['token', 'sentence']}) | ||
| additional_saved_files: Optional[List[str]] = None | ||
| metric_warmup_step: Optional[float] = 0 | ||
| train_dataset_sample: Optional[int] = -1 |
There was a problem hiding this comment.
同, train_dataset_sample将会在下个版本移除,可能产生影响
swift/trainers/callback.py
Outdated
| logs=None, | ||
| **kwargs): | ||
| logs['global_step'] = state.global_step | ||
| if state.global_step >= self.metric_warmup_step and self.warmup_start_time == 0: |
swift/trainers/callback.py
Outdated
| self.training_bar = tqdm( | ||
| desc='Train', total=state.max_steps, dynamic_ncols=True) | ||
| self.current_step = 0 | ||
| self.warmup_start_time = 0 |
There was a problem hiding this comment.
可以解释一下这里嘛
- 这里是因为torchacc在初始的阶段需要进行图编译,导致训练速度比后期稳定慢很多,因此加入了这个warmup_step,表示从训练的第warm_up step开始再计算一个训练的平均速度。
- 这里的计算逻辑是如果当前step到达了warm_up step,调用transformers的speed_metric函数计算一个指标并进行更新。
- args.metric_warmup_step可以是int或者float,表示具体的步数或者比例
swift/llm/sft.py
Outdated
| logger.info(f'The logging file will be saved in: {logging_path}') | ||
| trainer.train(training_args.resume_from_checkpoint) | ||
|
|
||
| if args.use_profiler: |
There was a problem hiding this comment.
可以介绍一下这里的逻辑嘛,或者使用环境变量隔离
There was a problem hiding this comment.
可以介绍一下这里的逻辑嘛,或者使用环境变量隔离
这里是加了对训练的profiler功能,和使用torchacc是独立的
…3_paligemma * commit '20bc628746772836fe3838e16e87fb27c39b5ec8': fix val_dataset (modelscope#992) update custom_val_dataset (modelscope#991) [TorchAcc][Experimental] Integrate more model in torchacc (modelscope#683) fix cpu 'torch._C' has no attribute '_cuda_resetPeakMemoryStats' (modelscope#914) refactor readme web-ui (modelscope#983) support transformers==4.41 (modelscope#979) support more models (modelscope#971)
* main: (23 commits) fix gr limit (modelscope#1016) fix minicpm-v (modelscope#1010) fix cogvlm2 history (modelscope#1005) 更新了Command-line-parameters.md里面的一个链接 (modelscope#1001) fix template example copy (modelscope#1003) Feat/phi3 paligemma (modelscope#998) fix pt deploy lora (modelscope#999) fix args (modelscope#996) fix val_dataset (modelscope#992) update custom_val_dataset (modelscope#991) [TorchAcc][Experimental] Integrate more model in torchacc (modelscope#683) fix cpu 'torch._C' has no attribute '_cuda_resetPeakMemoryStats' (modelscope#914) refactor readme web-ui (modelscope#983) support transformers==4.41 (modelscope#979) support more models (modelscope#971) Fix minicpm device map (modelscope#978) fix typing (modelscope#974) fix vllm eos_token (modelscope#973) Support minicpm-v-v2_5-chat (modelscope#970) support cogvlm2-en-chat-19b (modelscope#967) ...
PR type
PR information
Previous PR: #647
Experiment results
Paste your experiment result here(if needed).
We have test some models for torchacc and swift