modelscope · Jintao-Huang · Mar 17, 2025 · Jan 8, 2025 · Jan 9, 2025 · Jan 9, 2025
diff --git a/.gitignore b/.gitignore
@@ -141,6 +141,7 @@ my_model/
 result/
 images
 /custom/
+megatron_output/
 
 # Pytorch
 *.pth

diff --git a/README.md b/README.md
@@ -78,6 +78,7 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 🎉 News
+- 🎁 2025.03.16: SWIFT supports training with Megatron's parallel technology. Please refer to the [Megatron-SWIFT Training Documentation](https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html).
 - 🎁 2025.03.15: SWIFT support the fine-tuning of gme(multi-modal) embedding models，please check the [training script](examples/train/embedding/train_gme.sh)。
 - 🎁 2025.03.13: We provide a script of GRPO to train a 72B model with only 4 GPUs(4*80G), please check [here](examples/train/grpo/train_72b_4gpu.sh)
 - 🎁 2025.03.05: We support the hybrid mode of GRPO(rollout and actor on the same GPU, rollout sleep when actor training), meanwhile tensor parallel for GRPO, check [training script here](examples/train/grpo/multi_gpu_mp_colocate.sh)

diff --git a/README_CN.md b/README_CN.md
@@ -74,6 +74,7 @@
 - **模型量化**：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🎁 2025.03.16: SWIFT支持了Megatron的并行技术进行训练，请查看[Megatron-SWIFT训练文档](https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT训练文档.html)。
 - 🎁 2025.03.15: SWIFT支持了gme（多模态）embedding模型的微调，请查看[训练脚本](examples/train/embedding/train_gme.sh)。
 - 🎁 2025.03.13: 我们提供了一个仅使用4GPU(4*80G)来训练72B模型的脚本, 请查看[这里](examples/train/grpo/train_72b_4gpu.sh)
 - 🎁 2025.03.05: 支持GRPO的hybrid模式(rollout和actor在同一GPU上, rollout可以进行offload), 同时支持了vllm的tensor parallel, 查看[训练脚本](examples/train/grpo/multi_gpu_mp_colocate.sh)

diff --git a/docs/source/Instruction/Megatron-SWIFT训练.md b/docs/source/Instruction/Megatron-SWIFT训练.md
@@ -0,0 +1,222 @@
+
+# Megatron-SWIFT训练
+
+## 环境准备
+使用Megatron-SWIFT，除了安装swift依赖外，还需要安装以下内容：
+
+```shell
+pip install pybind11
+# transformer_engine
+pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
+
+# apex
+git clone https://github.com/NVIDIA/apex
+cd apex
+pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
+```
+
+依赖库Megatron-LM将会由swift进行git clone并安装，不需要用户手动安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径（断网环境）。
+
+
+## 快速入门案例
+
+这里介绍使用2卡80GiB A100对Qwen2.5-7B-Instruct模型进行自我认知微调的快速入门案例，以下最佳实践可以在10分钟内完成。
+
+首先，我们需要将HF格式的权重转为Megatron格式：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --to_mcore true \
+    --torch_dtype bfloat16 \
+    --test_convert_precision true \
+    --output_dir Qwen2.5-7B-Instruct-mcore
+```
+
+然后，使用以下脚本进行训练，训练所需显存资源为2*80GiB：
+```shell
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --load Qwen2.5-7B-Instruct-mcore \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --tensor_model_parallel_size 2 \
+    --micro_batch_size 4 \
+    --global_batch_size 16 \
+    --recompute_granularity selective \
+    --train_iters 100 \
+    --eval_iters 5 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 10 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen2.5-7B-Instruct \
+    --save_interval 100 \
+    --max_length 2048 \
+    --system 'You are a helpful assistant.' \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 4 \
+    --model_author swift \
+    --model_name swift-robot
+```
+
+最后，将Megatron格式权重转为HF格式：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+    --to_hf true \
+    --torch_dtype bfloat16 \
+    --test_convert_precision true \
+    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf
+```
+
+我们对生成的HF格式权重进行推理：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --stream true \
+    --temperature 0 \
+    --max_new_tokens 2048
+```
+
+推理结果如下：
+```
+<<< who are you?
+I am a language model developed by swift, you can call me swift-robot. How can I assist you?
+```
+
+- 更多案例可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron)。
+
+
+## 命令行参数
+
+### Megatron参数
+
+
+**训练参数**:
+- 🔥micro_batch_size: 每个device的批次大小，默认为1。
+- 🔥global_batch_size: 总批次大小，等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。
+- 🔥recompute_granularity: 重新计算激活的粒度，可选项为'full', 'selective'。其中full代表重新计算整个transformer layer，selective代表只计算transformer layer中的核心注意力部分。通常'selective'是推荐的。默认为'selective'。
+- recompute_method: 该参数需将recompute_granularity设置为'full'才生效，可选项为'uniform', 'block'。默认为None。
+- recompute_num_layers: 该参数需将recompute_granularity设置为'full'才生效，默认为None。若`recompute_method`设置为uniform，该参数含义为每个均匀划分的重新计算单元的transformer layers数量。例如你可以指定为`--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`。recompute_num_layers越大，显存占用越小，计算成本越大。默认为None。
+- deterministic_mode: 确定性模式，这会导致训练速度下降，默认为False。
+- 🔥train_iters: 训练的总迭代次数，默认为None。
+- 🔥log_interval: log的时间间隔（单位：iters），默认为5。
+- tensorboard_dir: tensorboard日志写入的目录。默认None，即存储在`f'{save}/runs'`目录下。
+- no_masked_softmax_fusion: 默认为False。用于禁用query_key_value的scaling, masking, and softmax融合。
+- no_bias_dropout_fusion: 默认为False。用于禁用bias和dropout的融合。
+- no_bias_swiglu_fusion: 默认为False。指定`--no_bias_dropout_fusion true`，用于禁止bias和swiglu融合。
+- no_rope_fusion: 默认为False。指定`--no_rope_fusion true`用于禁止rope融合。
+- no_gradient_accumulation_fusion: 默认为False。指定`--no_gradient_accumulation_fusion true`用于禁用梯度累加融合。
+- 🔥cross_entropy_loss_fusion: 启动交叉熵损失计算融合。默认为False。
+- 🔥use_flash_attn: 使用 FlashAttention 注意力机制实现，默认为False。
+- 🔥optimizer: 优化器类型，可选为'adam'、'sgd'。默认为adam。
+- dataloader_type: 默认为'cyclic'，可选为'single', 'cyclic', 'external'。
+- manual_gc: 禁用默认垃圾回收器，手动触发垃圾回收。默认为False。
+- manual_gc_interval: 触发垃圾回收的间隔。默认为0。
+- seed: python、numpy、pytorch和cuda的随机种子，默认为42。
+- 🔥num_workers: dataloder的workers数量，默认为4。
+- seq_length: 要处理的最大序列长度。默认为None，即设置为`max_position_embeddings`。Megatron-SWIFT采用训练中批次动态padding，因此通常无需修改该参数。对数据集长度进行限制请使用基本参数中的`--max_length`控制。
+- use_cpu_initialization: 在cpu上初始化权重，默认为False。在进行HF和MCore权重转换时会被使用。
+- no_create_attention_mask_in_dataloader: 在dataloader中不创建attention mask，默认为True。
+
+
+**学习率参数**:
+- 🔥lr: 初始学习率，最终会根据学习率预热策略和衰减策略决定每个迭代的学习率，默认为1e-5。
+- lr_decay_style: 学习率衰减策略，默认为'cosine'。通常设置为'cosine', 'linear', 'constant'。
+- 🔥lr_decay_iters: 学习率衰减的迭代次数。默认为None，则设置为`--train_iters`。
+- 🔥lr_warmup_iters: 线性学习率预热的迭代次数，默认为0。
+- 🔥min_lr: 学习率的最小值，将低于改阈值的学习率裁剪为该值，默认为0。
+
+**正则化参数**:
+- 🔥weight_decay: 默认为0.1。
+- 🔥clip_grad: l2梯度裁剪，默认为1.0。
+- adam_beta1: 默认0.9。
+- adam_beta2: 默认0.95。
+- adam_eps: 默认1e-8。
+- sgd_momentum: 默认为0.9。
+
+**checkpoint参数**:
+- 🔥save: checkpoint的输出目录，默认None。在训练中，若未设置该参数，则默认为`f'megatron_output/{model_suffix}'`，例如`'megatron_output/Qwen2.5-7B-Instruct'`。
+- 🔥save_interval: checkpoint保存的间隔（steps），默认为500。
+  - 注意：训练结束时一定会保存权重。
+- 🔥no_save_optim: 不保存optimizer，默认为False。
+- 🔥no_save_rng: 不保存rng，默认为False。
+- 🔥load: 加载的checkpoint目录，默认None。
+- 🔥no_load_optim: 不载入optimizer，默认为False。
+- 🔥no_load_rng: 不载入rng，默认为False。
+- 🔥finetune: 将模型加载并微调。不加载检查点的优化器和随机种子状态，并将迭代数设置为0。默认为False。
+- ckpt_format: checkpoint的格式。可选为'torch', 'torch_dist', 'zarr'。默认为'torch_dist'。
+- no_initialization: 不对权重进行初始化，默认为True。
+- auto_detect_ckpt_format: 自动检测ckpt format为legacy还是distributed格式。默认为True。
+- exit_on_missing_checkpoint: 如果设置了`–-load`，但找不到检查点，则直接退出，而不是初始化。默认为True。
+
+**分布式参数**:
+- distributed_backend: 分布式后端，可选为'nccl', 'gloo'。默认为nccl。
+- 🔥use_distributed_optimizer: 使用分布式优化器。默认为True。
+- 🔥tensor_model_parallel_size: tp数，默认为1。
+- 🔥pipeline_model_parallel_size: pp数，默认为1。
+- 🔥sequence_parallel: 启动序列并行的优化器。默认为False。
+- 🔥context_parallel_size: cp数，默认为1。
+- tp_comm_overlap: 启用张量并行通信与GEMM（通用矩阵乘法）内核的重叠（降低通信耗时）。默认为False。
+- overlap_grad_reduce: 启用DDP中grad reduce操作的重叠（降低DP通信耗时）。默认为False。
+- overlap_param_gather: 启用分布式优化器中参数all-gather的重叠（降低DP通信耗时）。默认为False。
+- distributed_timeout_minutes: torch.distributed的timeout时间（单位为分钟），默认为60分钟。
+
+**日志参数**
+- log_params_norm: 记录参数的norm。默认为True。
+- log_throughput: 记录每个GPU的吞吐量。默认为True。
+- tensorboard_log_interval: 记录到tensorboard的间隔（steps），默认为1。
+- tensorboard_queue_size: 队列长度（与磁盘IO相关），类似于写入的间隔。默认为50。
+- log_timers_to_tensorboard: 记录timers到tensorboard。默认为True。
+- no_log_learning_rate_to_tensorboard: 不记录学习率到tensorboard。默认为False。
+- log_validation_ppl_to_tensorboard: 将验证困惑度写入tensorboard。默认为True。
+- log_memory_to_tensorboard: 将内存日志写入tensorboard。默认为True。
+- logging_leval: 日志级别。默认为None。
+
+**评估参数**
+- 🔥eval_iters: 评估的迭代次数，默认为100。
+- 🔥eval_interval: 评估的间隔（steps），默认为None，即设置为save_interval。
+
+**混合精度参数**
+- fp16: fp16模式。默认为False。会根据模型的torch_dtype进行设置。
+- bf16: bf16模式。默认为False。会根据模型的torch_dtype进行设置。
+- apply_query_key_layer_scaling: 将`Q * K^T` 缩放为 `1 / 层数`（例如：第layer_num层则除以layer_num）。这对fp16训练很有帮助。默认为None，即若使用`--fp16`，则设置为True。
+- attention_softmax_in_fp32: 在attention_mask和softmax中使用fp32进行计算。默认为True。
+
+**模型参数**: （以下参数通常不需要进行设置，会根据HF模型的config.json进行配置，用户无需关心）
+- num_layers: transformer layers的层数，默认为None。
+- hidden_size: transformer hidden size，默认为None。
+- ffn_hidden_size: transformer FFN层的hidden size。默认为None，设置为`4*hidden_size`。
+- num_attention_heads: transformer attention heads的个数，默认为None。
+- group_query_attention: 默认为None。若`num_query_groups>1`，group_query_attention设置为True，否则为False。
+- num_query_groups: 默认为1。
+- max_position_embeddings: 位置编码的最大长度，默认为None。
+- position_embedding_type: 位置编码的类型，可选为'learned_absolute'、'rope'、'relative'和'none'，默认为'rope'。
+- rotary_base: 默认为10000。
+- rotary_percent: 默认为1.。
+- rotary_seq_len_interpolation_factor: 序列长度差值系数，默认为None。
+- normalization: 可选为'LayerNorm', 'RMSNorm'，默认为RMSNorm。
+- norm_epsilon: 默认为1e-5。
+- swiglu: 使用swiglu替代默认的gelu。默认为True。
+- untie_embeddings_and_output_weights: 解开embedding和输出权重的绑定，默认为True。
+- disable_bias_linear: 禁用linear层的bias。默认为True。
+- add_qkv_bias: 仅在QKV的linear中增加bias，默认为True。
+- attention_dropout: 默认为0.。
+- hidden_dropout: 默认为0.。
+- transformer_impl: 使用哪种transformer实现，可选项为'local'和'transformer_engine'。默认为transformer_engine。
+- padded_vocab_size: 完整词表大小，默认为None。
+
+### Megatron训练参数
+
+Megatron训练参数继承自Megatron参数和基本参数。基本参数的内容可以参考[这里](./命令行参数.md#基本参数)。此外还包括以下参数：
+
+- add_version: 在`save`上额外增加目录`'<版本号>-<时间戳>'`防止权重覆盖，默认为True。
+- 🔥lazy_tokenize: 默认为False。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（这可以避免在训练中出现报错）；设置为True，则在训练中对数据集进行tokenize（这可以节约内存）。
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -1,6 +1,6 @@
 # 命令行参数
 
-命令行参数的介绍会分为基本参数，原子参数、集成参数和特定模型参数。命令行最终使用的参数列表为集成参数。集成参数继承自基本参数和一些原子参数。特定模型参数是针对于具体模型的参数，可以通过`--model_kwargs'`或者环境变量进行设置。
+命令行参数的介绍会分为基本参数，原子参数、集成参数和特定模型参数。命令行最终使用的参数列表为集成参数。集成参数继承自基本参数和一些原子参数。特定模型参数是针对于具体模型的参数，可以通过`--model_kwargs'`或者环境变量进行设置。Megatron-SWIFT命令行参数介绍可以在[Megatron-SWIFT训练文档](./Megatron-SWIFT训练.md)中找到。
 
 提示：
 - 命令行传入list使用空格隔开即可。例如：`--dataset <dataset_path1> <dataset_path2>`。
@@ -142,6 +142,9 @@
 - 🔥ddp_backend: 默认为None，可选为"nccl"、"gloo"、"mpi"、"ccl"、"hccl" 、"cncl"、"mccl"
 - 🔥ddp_find_unused_parameters: 默认为None
 - 🔥dataloader_num_workers: 默认为0
+- dataloader_pin_memory: 默认为True
+- dataloader_persistent_workers: 默认为False
+- dataloader_prefetch_factor: 默认为2
 - 🔥neftune_noise_alpha: neftune添加的噪声系数, 默认为0，通常可以设置为5、10、15
 - average_tokens_across_devices: 是否在设备之间进行token数平均。如果设置为True，将使用all_reduce同步`num_tokens_in_batch`以进行精确的损失计算。默认为False
 - max_grad_norm: 梯度裁剪。默认为1.
@@ -485,6 +488,11 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
 - max_length: 校准集的max_length, 默认值2048
 - quant_batch_size: 量化batch_size，默认为1
 - group_size: 量化group大小，默认为128
+- to_ollama: 产生ollama所需的Modelfile文件。默认为False
+- 🔥to_mcore: HF格式权重转成Megatron格式。默认为False
+- to_hf: Megatron格式权重转成HF格式。默认为False
+- mcore_model: mcore格式模型路径。默认为None
+- test_convert_precision: 测试HF和Megatron格式权重转换的精度误差。默认为False
 - 🔥push_to_hub: 是否推送hub，默认为False。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh)
 - hub_model_id: 推送的model_id，默认为None
 - hub_private_repo: 是否是private repo，默认为False

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -21,6 +21,7 @@ Swift DOCUMENTATION
    Instruction/预训练与微调.md
    Instruction/人类对齐.md
    Instruction/推理和部署.md
+   Instruction/Megatron-SWIFT训练.md
    Instruction/采样.md
    Instruction/评测.md
    Instruction/导出与推送.md

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -1,6 +1,6 @@
 # Command Line Parameters
 
-The introduction to command line parameters will cover base arguments, atomic arguments, and integrated arguments, and specific model arguments. The final list of arguments used in the command line is the integration arguments. Integrated arguments inherit from basic arguments and some atomic arguments. Specific model arguments are designed for specific models and can be set using `--model_kwargs'` or the environment variable.
+The introduction to command line parameters will cover base arguments, atomic arguments, and integrated arguments, and specific model arguments. The final list of arguments used in the command line is the integration arguments. Integrated arguments inherit from basic arguments and some atomic arguments. Specific model arguments are designed for specific models and can be set using `--model_kwargs'` or the environment variable. The introduction to the Megatron-SWIFT command-line arguments can be found in the [Megatron-SWIFT Training Documentation](./Megatron-SWIFT-Training.md).
 
 Hints:
 
@@ -145,6 +145,9 @@ Other important parameters:
 - 🔥ddp_backend: Default is None, options include "nccl", "gloo", "mpi", "ccl", "hccl", "cncl", "mccl".
 - 🔥ddp_find_unused_parameters: Default is None.
 - 🔥dataloader_num_workers: Default is 0.
+- dataloader_pin_memory: Default is True.
+- dataloader_persistent_workers: Default is False.
+- dataloader_prefetch_factor: Default is 2.
 - 🔥neftune_noise_alpha: Coefficient of noise added by neftune, default is 0. Usually can be set to 5, 10, 15.
 - average_tokens_across_devices: Whether to average the number of tokens across devices. If set to True, `num_tokens_in_batch` will be synchronized using all_reduce for accurate loss calculation. Default is False.
 - max_grad_norm: Gradient clipping. Default is 1.
@@ -497,6 +500,11 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
 - max_length: Max length for the calibration set, default value is 2048.
 - quant_batch_size: Quantization batch size, default is 1.
 - group_size: Group size for quantization, default is 128.
+- to_ollama: Generate the Modelfile required by Ollama. Default is False.
+- 🔥to_mcore: Convert weights from HF format to Megatron format. Default is False.
+- to_hf: Convert weights from Megatron format to HF format. Default is False.
+- mcore_model: Path to the mcore format model. Default is None.
+- test_convert_precision: Test the precision error when converting weights between HF and Megatron formats. Default is False.
 - 🔥push_to_hub: Whether to push to the hub, with the default being False. Examples can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh).
 - hub_model_id: Model ID for pushing, default is None.
 - hub_private_repo: Whether it is a private repo, default is False.
-Original file line number
+Diff line change
@@ Expand Up / @@ -141,6 +141,7 @@ my_model/ @@
     result/
     images
     /custom/
+    megatron_output/
     # Pytorch
     *.pth
@@ Expand Down @@