Skip to content

support megatron #2885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 105 commits into from
Mar 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
d875242
support megatron
Jintao-Huang Jan 8, 2025
13e4a65
update
Jintao-Huang Jan 9, 2025
f230e01
update
Jintao-Huang Jan 9, 2025
9b92ae0
update
Jintao-Huang Jan 9, 2025
e02c519
update
Jintao-Huang Jan 9, 2025
b9b85e5
update
Jintao-Huang Jan 9, 2025
65fcd63
update
Jintao-Huang Jan 9, 2025
bd7547c
update
Jintao-Huang Jan 9, 2025
83dc334
update
Jintao-Huang Jan 9, 2025
9a8c458
update
Jintao-Huang Jan 9, 2025
836fbcf
update
Jintao-Huang Jan 9, 2025
a8e25a2
Merge branch 'main' into support_megatron_0108
Jintao-Huang Jan 9, 2025
2f71dbc
Merge branch 'main' into support_megatron_0108
Jintao-Huang Feb 26, 2025
f19ecee
lint pass
Jintao-Huang Feb 26, 2025
17f4f43
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 3, 2025
f8460d5
update
Jintao-Huang Mar 3, 2025
d5788bb
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 3, 2025
0bbdb12
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 4, 2025
2682285
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 5, 2025
04910f3
update
Jintao-Huang Mar 5, 2025
8514b8a
update
Jintao-Huang Mar 5, 2025
bdd1692
fix
Jintao-Huang Mar 5, 2025
6a0aa00
update
Jintao-Huang Mar 5, 2025
57aa7b3
update
Jintao-Huang Mar 5, 2025
83503d6
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 5, 2025
0260e35
Merge remote-tracking branch 'refs/remotes/origin/support_megatron_01…
Jintao-Huang Mar 5, 2025
745bead
fix
Jintao-Huang Mar 5, 2025
ff51fff
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 6, 2025
9c341f8
update
Jintao-Huang Mar 6, 2025
83d3bbc
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 6, 2025
2c52610
update
Jintao-Huang Mar 6, 2025
34cfa0c
update
Jintao-Huang Mar 6, 2025
2b18fa2
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 7, 2025
c5d110c
update
Jintao-Huang Mar 7, 2025
68ada6b
update
Jintao-Huang Mar 7, 2025
2388371
update
Jintao-Huang Mar 8, 2025
698d856
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 8, 2025
a3c6e59
update
Jintao-Huang Mar 9, 2025
ecc6367
lint pass
Jintao-Huang Mar 9, 2025
d30a001
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 9, 2025
12818a9
update
Jintao-Huang Mar 9, 2025
924dd97
Merge remote-tracking branch 'refs/remotes/origin/support_megatron_01…
Jintao-Huang Mar 9, 2025
4e7071f
update
Jintao-Huang Mar 9, 2025
9a8a2b5
update
Jintao-Huang Mar 9, 2025
5b6e2c1
Merge remote-tracking branch 'refs/remotes/origin/support_megatron_01…
Jintao-Huang Mar 9, 2025
e7a5f19
Merge remote-tracking branch 'refs/remotes/origin/support_megatron_01…
Jintao-Huang Mar 9, 2025
e8e548e
update
Jintao-Huang Mar 9, 2025
5a24b28
update
Jintao-Huang Mar 9, 2025
85093d9
update
Jintao-Huang Mar 9, 2025
63637ca
update
Jintao-Huang Mar 9, 2025
abcd6c6
update
Jintao-Huang Mar 9, 2025
e6a2e78
update
Jintao-Huang Mar 9, 2025
5c66711
Merge remote-tracking branch 'refs/remotes/origin/support_megatron_01…
Jintao-Huang Mar 9, 2025
277a7b1
fix
Jintao-Huang Mar 9, 2025
9f089c3
Merge remote-tracking branch 'refs/remotes/origin/support_megatron_01…
Jintao-Huang Mar 9, 2025
8b82a5f
update
Jintao-Huang Mar 9, 2025
170c6ad
update
Jintao-Huang Mar 9, 2025
43c9e15
update
Jintao-Huang Mar 9, 2025
707b123
update
Jintao-Huang Mar 10, 2025
d0aba11
update
Jintao-Huang Mar 10, 2025
1b86699
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 11, 2025
bbf59a4
update
Jintao-Huang Mar 11, 2025
57110e2
update
Jintao-Huang Mar 11, 2025
75b87e3
update
Jintao-Huang Mar 11, 2025
8d5c9a8
update
Jintao-Huang Mar 11, 2025
79549b3
Merge remote-tracking branch 'refs/remotes/origin/support_megatron_01…
Jintao-Huang Mar 11, 2025
210d3e1
update
Jintao-Huang Mar 11, 2025
20fc25f
update
Jintao-Huang Mar 11, 2025
55a6c28
update
Jintao-Huang Mar 11, 2025
f110818
update
Jintao-Huang Mar 11, 2025
9e235f8
update
Jintao-Huang Mar 11, 2025
da076ef
update
Jintao-Huang Mar 11, 2025
6d5574a
update
Jintao-Huang Mar 12, 2025
860e8c1
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 14, 2025
a9202dc
update
Jintao-Huang Mar 14, 2025
5534d7f
update
Jintao-Huang Mar 15, 2025
301982e
update
Jintao-Huang Mar 15, 2025
920d868
update
Jintao-Huang Mar 15, 2025
017ac34
update
Jintao-Huang Mar 15, 2025
b5a118f
fix
Jintao-Huang Mar 15, 2025
8c7ae0d
update
Jintao-Huang Mar 16, 2025
457238f
update
Jintao-Huang Mar 16, 2025
3d0b1f5
Merge branch 'main' into support_megatron_0108
Jintao-Huang Mar 16, 2025
9342633
update
Jintao-Huang Mar 16, 2025
8a843ea
update
Jintao-Huang Mar 16, 2025
14e1c29
update
Jintao-Huang Mar 16, 2025
fcdc178
update
Jintao-Huang Mar 16, 2025
2807335
update
Jintao-Huang Mar 16, 2025
cd8ed92
update
Jintao-Huang Mar 16, 2025
992249f
update
Jintao-Huang Mar 16, 2025
ee18776
update
Jintao-Huang Mar 16, 2025
a34b1f4
update
Jintao-Huang Mar 16, 2025
9d2bf7f
update
Jintao-Huang Mar 16, 2025
8bdbf66
update
Jintao-Huang Mar 16, 2025
7728d9d
update
Jintao-Huang Mar 16, 2025
2f90c98
update
Jintao-Huang Mar 16, 2025
7a9104f
update
Jintao-Huang Mar 16, 2025
6cd49eb
update
Jintao-Huang Mar 16, 2025
04fe85a
update
Jintao-Huang Mar 16, 2025
0be8cf9
update
Jintao-Huang Mar 16, 2025
bbac19e
update
Jintao-Huang Mar 16, 2025
e7cdec9
fix
Jintao-Huang Mar 16, 2025
a05abff
update
Jintao-Huang Mar 16, 2025
4a57e79
update
Jintao-Huang Mar 17, 2025
3325347
update
Jintao-Huang Mar 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ my_model/
result/
images
/custom/
megatron_output/

# Pytorch
*.pth
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ You can contact us and communicate with us by adding our group:


## 🎉 News
- 🎁 2025.03.16: SWIFT supports training with Megatron's parallel technology. Please refer to the [Megatron-SWIFT Training Documentation](https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html).
- 🎁 2025.03.15: SWIFT support the fine-tuning of gme(multi-modal) embedding models,please check the [training script](examples/train/embedding/train_gme.sh)。
- 🎁 2025.03.13: We provide a script of GRPO to train a 72B model with only 4 GPUs(4*80G), please check [here](examples/train/grpo/train_72b_4gpu.sh)
- 🎁 2025.03.05: We support the hybrid mode of GRPO(rollout and actor on the same GPU, rollout sleep when actor training), meanwhile tensor parallel for GRPO, check [training script here](examples/train/grpo/multi_gpu_mp_colocate.sh)
Expand Down
1 change: 1 addition & 0 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
- **模型量化**:支持AWQ、GPTQ和BNB的量化导出,导出的模型支持使用vLLM/LmDeploy推理加速,并支持继续训练。

## 🎉 新闻
- 🎁 2025.03.16: SWIFT支持了Megatron的并行技术进行训练,请查看[Megatron-SWIFT训练文档](https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT训练文档.html)。
- 🎁 2025.03.15: SWIFT支持了gme(多模态)embedding模型的微调,请查看[训练脚本](examples/train/embedding/train_gme.sh)。
- 🎁 2025.03.13: 我们提供了一个仅使用4GPU(4*80G)来训练72B模型的脚本, 请查看[这里](examples/train/grpo/train_72b_4gpu.sh)
- 🎁 2025.03.05: 支持GRPO的hybrid模式(rollout和actor在同一GPU上, rollout可以进行offload), 同时支持了vllm的tensor parallel, 查看[训练脚本](examples/train/grpo/multi_gpu_mp_colocate.sh)
Expand Down
222 changes: 222 additions & 0 deletions docs/source/Instruction/Megatron-SWIFT训练.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@

# Megatron-SWIFT训练

## 环境准备
使用Megatron-SWIFT,除了安装swift依赖外,还需要安装以下内容:

```shell
pip install pybind11
# transformer_engine
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

# apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
```

依赖库Megatron-LM将会由swift进行git clone并安装,不需要用户手动安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径(断网环境)。


## 快速入门案例

这里介绍使用2卡80GiB A100对Qwen2.5-7B-Instruct模型进行自我认知微调的快速入门案例,以下最佳实践可以在10分钟内完成。

首先,我们需要将HF格式的权重转为Megatron格式:
```shell
CUDA_VISIBLE_DEVICES=0 \
swift export \
--model Qwen/Qwen2.5-7B-Instruct \
--to_mcore true \
--torch_dtype bfloat16 \
--test_convert_precision true \
--output_dir Qwen2.5-7B-Instruct-mcore
```

然后,使用以下脚本进行训练,训练所需显存资源为2*80GiB:
```shell
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron sft \
--load Qwen2.5-7B-Instruct-mcore \
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
'AI-ModelScope/alpaca-gpt4-data-en#500' \
'swift/self-cognition#500' \
--tensor_model_parallel_size 2 \
--micro_batch_size 4 \
--global_batch_size 16 \
--recompute_granularity selective \
--train_iters 100 \
--eval_iters 5 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_iters 10 \
--min_lr 1e-6 \
--save megatron_output/Qwen2.5-7B-Instruct \
--save_interval 100 \
--max_length 2048 \
--system 'You are a helpful assistant.' \
--num_workers 4 \
--no_save_optim true \
--no_save_rng true \
--dataset_num_proc 4 \
--model_author swift \
--model_name swift-robot
```

最后,将Megatron格式权重转为HF格式:
```shell
CUDA_VISIBLE_DEVICES=0 \
swift export \
--mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
--to_hf true \
--torch_dtype bfloat16 \
--test_convert_precision true \
--output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf
```

我们对生成的HF格式权重进行推理:
```shell
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
--stream true \
--temperature 0 \
--max_new_tokens 2048
```

推理结果如下:
```
<<< who are you?
I am a language model developed by swift, you can call me swift-robot. How can I assist you?
```

- 更多案例可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron)。


## 命令行参数

### Megatron参数


**训练参数**:
- 🔥micro_batch_size: 每个device的批次大小,默认为1。
- 🔥global_batch_size: 总批次大小,等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。
- 🔥recompute_granularity: 重新计算激活的粒度,可选项为'full', 'selective'。其中full代表重新计算整个transformer layer,selective代表只计算transformer layer中的核心注意力部分。通常'selective'是推荐的。默认为'selective'。
- recompute_method: 该参数需将recompute_granularity设置为'full'才生效,可选项为'uniform', 'block'。默认为None。
- recompute_num_layers: 该参数需将recompute_granularity设置为'full'才生效,默认为None。若`recompute_method`设置为uniform,该参数含义为每个均匀划分的重新计算单元的transformer layers数量。例如你可以指定为`--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`。recompute_num_layers越大,显存占用越小,计算成本越大。默认为None。
- deterministic_mode: 确定性模式,这会导致训练速度下降,默认为False。
- 🔥train_iters: 训练的总迭代次数,默认为None。
- 🔥log_interval: log的时间间隔(单位:iters),默认为5。
- tensorboard_dir: tensorboard日志写入的目录。默认None,即存储在`f'{save}/runs'`目录下。
- no_masked_softmax_fusion: 默认为False。用于禁用query_key_value的scaling, masking, and softmax融合。
- no_bias_dropout_fusion: 默认为False。用于禁用bias和dropout的融合。
- no_bias_swiglu_fusion: 默认为False。指定`--no_bias_dropout_fusion true`,用于禁止bias和swiglu融合。
- no_rope_fusion: 默认为False。指定`--no_rope_fusion true`用于禁止rope融合。
- no_gradient_accumulation_fusion: 默认为False。指定`--no_gradient_accumulation_fusion true`用于禁用梯度累加融合。
- 🔥cross_entropy_loss_fusion: 启动交叉熵损失计算融合。默认为False。
- 🔥use_flash_attn: 使用 FlashAttention 注意力机制实现,默认为False。
- 🔥optimizer: 优化器类型,可选为'adam'、'sgd'。默认为adam。
- dataloader_type: 默认为'cyclic',可选为'single', 'cyclic', 'external'。
- manual_gc: 禁用默认垃圾回收器,手动触发垃圾回收。默认为False。
- manual_gc_interval: 触发垃圾回收的间隔。默认为0。
- seed: python、numpy、pytorch和cuda的随机种子,默认为42。
- 🔥num_workers: dataloder的workers数量,默认为4。
- seq_length: 要处理的最大序列长度。默认为None,即设置为`max_position_embeddings`。Megatron-SWIFT采用训练中批次动态padding,因此通常无需修改该参数。对数据集长度进行限制请使用基本参数中的`--max_length`控制。
- use_cpu_initialization: 在cpu上初始化权重,默认为False。在进行HF和MCore权重转换时会被使用。
- no_create_attention_mask_in_dataloader: 在dataloader中不创建attention mask,默认为True。


**学习率参数**:
- 🔥lr: 初始学习率,最终会根据学习率预热策略和衰减策略决定每个迭代的学习率,默认为1e-5。
- lr_decay_style: 学习率衰减策略,默认为'cosine'。通常设置为'cosine', 'linear', 'constant'。
- 🔥lr_decay_iters: 学习率衰减的迭代次数。默认为None,则设置为`--train_iters`。
- 🔥lr_warmup_iters: 线性学习率预热的迭代次数,默认为0。
- 🔥min_lr: 学习率的最小值,将低于改阈值的学习率裁剪为该值,默认为0。

**正则化参数**:
- 🔥weight_decay: 默认为0.1。
- 🔥clip_grad: l2梯度裁剪,默认为1.0。
- adam_beta1: 默认0.9。
- adam_beta2: 默认0.95。
- adam_eps: 默认1e-8。
- sgd_momentum: 默认为0.9。

**checkpoint参数**:
- 🔥save: checkpoint的输出目录,默认None。在训练中,若未设置该参数,则默认为`f'megatron_output/{model_suffix}'`,例如`'megatron_output/Qwen2.5-7B-Instruct'`。
- 🔥save_interval: checkpoint保存的间隔(steps),默认为500。
- 注意:训练结束时一定会保存权重。
- 🔥no_save_optim: 不保存optimizer,默认为False。
- 🔥no_save_rng: 不保存rng,默认为False。
- 🔥load: 加载的checkpoint目录,默认None。
- 🔥no_load_optim: 不载入optimizer,默认为False。
- 🔥no_load_rng: 不载入rng,默认为False。
- 🔥finetune: 将模型加载并微调。不加载检查点的优化器和随机种子状态,并将迭代数设置为0。默认为False。
- ckpt_format: checkpoint的格式。可选为'torch', 'torch_dist', 'zarr'。默认为'torch_dist'。
- no_initialization: 不对权重进行初始化,默认为True。
- auto_detect_ckpt_format: 自动检测ckpt format为legacy还是distributed格式。默认为True。
- exit_on_missing_checkpoint: 如果设置了`–-load`,但找不到检查点,则直接退出,而不是初始化。默认为True。

**分布式参数**:
- distributed_backend: 分布式后端,可选为'nccl', 'gloo'。默认为nccl。
- 🔥use_distributed_optimizer: 使用分布式优化器。默认为True。
- 🔥tensor_model_parallel_size: tp数,默认为1。
- 🔥pipeline_model_parallel_size: pp数,默认为1。
- 🔥sequence_parallel: 启动序列并行的优化器。默认为False。
- 🔥context_parallel_size: cp数,默认为1。
- tp_comm_overlap: 启用张量并行通信与GEMM(通用矩阵乘法)内核的重叠(降低通信耗时)。默认为False。
- overlap_grad_reduce: 启用DDP中grad reduce操作的重叠(降低DP通信耗时)。默认为False。
- overlap_param_gather: 启用分布式优化器中参数all-gather的重叠(降低DP通信耗时)。默认为False。
- distributed_timeout_minutes: torch.distributed的timeout时间(单位为分钟),默认为60分钟。

**日志参数**
- log_params_norm: 记录参数的norm。默认为True。
- log_throughput: 记录每个GPU的吞吐量。默认为True。
- tensorboard_log_interval: 记录到tensorboard的间隔(steps),默认为1。
- tensorboard_queue_size: 队列长度(与磁盘IO相关),类似于写入的间隔。默认为50。
- log_timers_to_tensorboard: 记录timers到tensorboard。默认为True。
- no_log_learning_rate_to_tensorboard: 不记录学习率到tensorboard。默认为False。
- log_validation_ppl_to_tensorboard: 将验证困惑度写入tensorboard。默认为True。
- log_memory_to_tensorboard: 将内存日志写入tensorboard。默认为True。
- logging_leval: 日志级别。默认为None。

**评估参数**
- 🔥eval_iters: 评估的迭代次数,默认为100。
- 🔥eval_interval: 评估的间隔(steps),默认为None,即设置为save_interval。

**混合精度参数**
- fp16: fp16模式。默认为False。会根据模型的torch_dtype进行设置。
- bf16: bf16模式。默认为False。会根据模型的torch_dtype进行设置。
- apply_query_key_layer_scaling: 将`Q * K^T` 缩放为 `1 / 层数`(例如:第layer_num层则除以layer_num)。这对fp16训练很有帮助。默认为None,即若使用`--fp16`,则设置为True。
- attention_softmax_in_fp32: 在attention_mask和softmax中使用fp32进行计算。默认为True。

**模型参数**: (以下参数通常不需要进行设置,会根据HF模型的config.json进行配置,用户无需关心)
- num_layers: transformer layers的层数,默认为None。
- hidden_size: transformer hidden size,默认为None。
- ffn_hidden_size: transformer FFN层的hidden size。默认为None,设置为`4*hidden_size`。
- num_attention_heads: transformer attention heads的个数,默认为None。
- group_query_attention: 默认为None。若`num_query_groups>1`,group_query_attention设置为True,否则为False。
- num_query_groups: 默认为1。
- max_position_embeddings: 位置编码的最大长度,默认为None。
- position_embedding_type: 位置编码的类型,可选为'learned_absolute'、'rope'、'relative'和'none',默认为'rope'。
- rotary_base: 默认为10000。
- rotary_percent: 默认为1.。
- rotary_seq_len_interpolation_factor: 序列长度差值系数,默认为None。
- normalization: 可选为'LayerNorm', 'RMSNorm',默认为RMSNorm。
- norm_epsilon: 默认为1e-5。
- swiglu: 使用swiglu替代默认的gelu。默认为True。
- untie_embeddings_and_output_weights: 解开embedding和输出权重的绑定,默认为True。
- disable_bias_linear: 禁用linear层的bias。默认为True。
- add_qkv_bias: 仅在QKV的linear中增加bias,默认为True。
- attention_dropout: 默认为0.。
- hidden_dropout: 默认为0.。
- transformer_impl: 使用哪种transformer实现,可选项为'local'和'transformer_engine'。默认为transformer_engine。
- padded_vocab_size: 完整词表大小,默认为None。

### Megatron训练参数

Megatron训练参数继承自Megatron参数和基本参数。基本参数的内容可以参考[这里](./命令行参数.md#基本参数)。此外还包括以下参数:

- add_version: 在`save`上额外增加目录`'<版本号>-<时间戳>'`防止权重覆盖,默认为True。
- 🔥lazy_tokenize: 默认为False。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(这可以避免在训练中出现报错);设置为True,则在训练中对数据集进行tokenize(这可以节约内存)。
10 changes: 9 additions & 1 deletion docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 命令行参数

命令行参数的介绍会分为基本参数,原子参数、集成参数和特定模型参数。命令行最终使用的参数列表为集成参数。集成参数继承自基本参数和一些原子参数。特定模型参数是针对于具体模型的参数,可以通过`--model_kwargs'`或者环境变量进行设置。
命令行参数的介绍会分为基本参数,原子参数、集成参数和特定模型参数。命令行最终使用的参数列表为集成参数。集成参数继承自基本参数和一些原子参数。特定模型参数是针对于具体模型的参数,可以通过`--model_kwargs'`或者环境变量进行设置。Megatron-SWIFT命令行参数介绍可以在[Megatron-SWIFT训练文档](./Megatron-SWIFT训练.md)中找到。

提示:
- 命令行传入list使用空格隔开即可。例如:`--dataset <dataset_path1> <dataset_path2>`。
Expand Down Expand Up @@ -142,6 +142,9 @@
- 🔥ddp_backend: 默认为None,可选为"nccl"、"gloo"、"mpi"、"ccl"、"hccl" 、"cncl"、"mccl"
- 🔥ddp_find_unused_parameters: 默认为None
- 🔥dataloader_num_workers: 默认为0
- dataloader_pin_memory: 默认为True
- dataloader_persistent_workers: 默认为False
- dataloader_prefetch_factor: 默认为2
- 🔥neftune_noise_alpha: neftune添加的噪声系数, 默认为0,通常可以设置为5、10、15
- average_tokens_across_devices: 是否在设备之间进行token数平均。如果设置为True,将使用all_reduce同步`num_tokens_in_batch`以进行精确的损失计算。默认为False
- max_grad_norm: 梯度裁剪。默认为1.
Expand Down Expand Up @@ -485,6 +488,11 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
- max_length: 校准集的max_length, 默认值2048
- quant_batch_size: 量化batch_size,默认为1
- group_size: 量化group大小,默认为128
- to_ollama: 产生ollama所需的Modelfile文件。默认为False
- 🔥to_mcore: HF格式权重转成Megatron格式。默认为False
- to_hf: Megatron格式权重转成HF格式。默认为False
- mcore_model: mcore格式模型路径。默认为None
- test_convert_precision: 测试HF和Megatron格式权重转换的精度误差。默认为False
- 🔥push_to_hub: 是否推送hub,默认为False。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh)
- hub_model_id: 推送的model_id,默认为None
- hub_private_repo: 是否是private repo,默认为False
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Swift DOCUMENTATION
Instruction/预训练与微调.md
Instruction/人类对齐.md
Instruction/推理和部署.md
Instruction/Megatron-SWIFT训练.md
Instruction/采样.md
Instruction/评测.md
Instruction/导出与推送.md
Expand Down
10 changes: 9 additions & 1 deletion docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Command Line Parameters

The introduction to command line parameters will cover base arguments, atomic arguments, and integrated arguments, and specific model arguments. The final list of arguments used in the command line is the integration arguments. Integrated arguments inherit from basic arguments and some atomic arguments. Specific model arguments are designed for specific models and can be set using `--model_kwargs'` or the environment variable.
The introduction to command line parameters will cover base arguments, atomic arguments, and integrated arguments, and specific model arguments. The final list of arguments used in the command line is the integration arguments. Integrated arguments inherit from basic arguments and some atomic arguments. Specific model arguments are designed for specific models and can be set using `--model_kwargs'` or the environment variable. The introduction to the Megatron-SWIFT command-line arguments can be found in the [Megatron-SWIFT Training Documentation](./Megatron-SWIFT-Training.md).

Hints:

Expand Down Expand Up @@ -145,6 +145,9 @@ Other important parameters:
- 🔥ddp_backend: Default is None, options include "nccl", "gloo", "mpi", "ccl", "hccl", "cncl", "mccl".
- 🔥ddp_find_unused_parameters: Default is None.
- 🔥dataloader_num_workers: Default is 0.
- dataloader_pin_memory: Default is True.
- dataloader_persistent_workers: Default is False.
- dataloader_prefetch_factor: Default is 2.
- 🔥neftune_noise_alpha: Coefficient of noise added by neftune, default is 0. Usually can be set to 5, 10, 15.
- average_tokens_across_devices: Whether to average the number of tokens across devices. If set to True, `num_tokens_in_batch` will be synchronized using all_reduce for accurate loss calculation. Default is False.
- max_grad_norm: Gradient clipping. Default is 1.
Expand Down Expand Up @@ -497,6 +500,11 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
- max_length: Max length for the calibration set, default value is 2048.
- quant_batch_size: Quantization batch size, default is 1.
- group_size: Group size for quantization, default is 128.
- to_ollama: Generate the Modelfile required by Ollama. Default is False.
- 🔥to_mcore: Convert weights from HF format to Megatron format. Default is False.
- to_hf: Convert weights from Megatron format to HF format. Default is False.
- mcore_model: Path to the mcore format model. Default is None.
- test_convert_precision: Test the precision error when converting weights between HF and Megatron formats. Default is False.
- 🔥push_to_hub: Whether to push to the hub, with the default being False. Examples can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh).
- hub_model_id: Model ID for pushing, default is None.
- hub_private_repo: Whether it is a private repo, default is False.
Expand Down
Loading
Loading