中文版
新特性
- Megatron-SWIFT
a. 支持更多模型架构:Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF等。完整的模型支持情况,参考支持的模型文档:https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. 支持KTO训练,包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。感谢招商银行技术团队@kevssim 的贡献。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. 支持RM训练,包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. 支持序列分类模型架构,包括三种任务:regression、single_label_classification、multi_label_classification。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. 支持VPP并行技术,减少PP并行的计算空泡,提高GPU利用率,但会略微提高通信量。支持异构PP并行pipeline_model_parallel_layout
,自定义流水线并行(PP/VPP)布局。
f. DPO等RLHF技术中的ref_model不初始化 main_grad 降低显存占用。 - 训练
a. 序列并行优化,ulysses 和 ring-attention 支持混合使用,实现更长的序列处理能力。支持纯文本和多模态模型的SFT/DPO/GRPO训练。训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. 纯文本及多模态模型Embedding/Reranker/序列分类任务训练支持使用 padding_free 以节约显存资源并加速训练。
c. Embedding和Reranker训练数据集格式重构,具体参考文档:https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent template支持更多模型:deepseek_v3_1, qwen3_coder。(感谢@gakkiri ,@ray075hl 的贡献)
e.load_from_cache_file
默认值从True改成False,避免因缓存原因导致的未知问题。 - RLHF
a. GRPO支持CHORD算法,在GRPO训练中混合SFT训练,参考文档:https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO支持padding free和packing以节约显存资源并加速训练。
c. GRPO训练 padding_free重构,更好支持多模态模型。
d. GRPO vLLM 支持PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
环境变量,减小显存碎片。 - 推理
a. 支持Reranker任务的推理/部署 (pt/vllm),以及序列分类任务的推理部署(pt/vllm)。脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls
新模型
- 纯文本模型
a. Qwen/Qwen3-Next-80B-A3B-Instruct系列,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0系列
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct系列(感谢@hpsun1109 的贡献)
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m(embedding模型) - 多模态模型
a. Qwen/Qwen3-VL-30B-A3B-Instruct系列,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct系列,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B(感谢@hellopahe 的贡献)
d. OpenGVLab/InternVL3_5-1B-HF系列
e. BytedanceDouyinContent/SAIL-VL2-2B系列
f. stepfun-ai/Step-Audio-2-mini(感谢@CJack812 的贡献)
English Version
New Features
- Megatron-SWIFT
a. More model architecture support: Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF, etc. For a complete list of supported models, please refer to the Supported Models documentation: https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. KTO training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Special thanks to @kevssim from China Merchants Bank’s technical team for their contribution. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. Reward Model training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. Sequence classification model architecture support, covering three task types: regression, single_label_classification, and multi_label_classification. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. Support for VPP (Virtual Pipeline Parallelism): reduces pipeline bubbles in PP (Pipeline Parallelism), improving GPU utilization at the cost of slightly increased communication overhead. Supports heterogeneous PP viapipeline_model_parallel_layout
for custom PP/VPP pipeline layouts.
f. In RLHF techniques such as DPO, the ref_model no longer initializes main_grad, reducing GPU memory consumption. - Training
a. Sequence parallelism optimization: Ulysses and Ring Attention can now be used together, enabling processing of even longer sequences. Supports SFT/DPO/GRPO training for both text-only and multimodal models. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. Padding-free training is now supported for embedding, reranker, and sequence classification tasks on both text-only and multimodal models, saving GPU memory and accelerating training.
c. Restructured dataset formats for embedding and reranker training. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent templates support more models: deepseek_v3_1, qwen3_coder. (Thanks to contributions from @gakkiri and @ray075hl)
e. Default value ofload_from_cache_file
changed from True to False to avoid unexpected issues caused by caching. - RLHF
a. GRPO now supports the CHORD algorithm, enabling mixed SFT training during GRPO. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO supports padding-free and packing, reducing memory usage and accelerating training.
c. Padding-free implementation in GRPO has been refactored for better multimodal model support.
d. GRPO with vLLM now supports the environment variablePYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
to reduce GPU memory fragmentation. - Inference
a. Inference and deployment support for Reranker tasks (PyTorch/vLLM) and sequence classification tasks (PyTorch/vLLM). Example scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls
New Models
New Models
- Text-only Models
a. Qwen/Qwen3-Next-80B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0 series
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct series (Thanks to @hpsun1109 for the contribution)
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m (embedding model) - Multimodal Models
a. Qwen/Qwen3-VL-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B (Thanks to @hellopahe for the contribution)
d. OpenGVLab/InternVL3_5-1B-HF series
e. BytedanceDouyinContent/SAIL-VL2-2B series
f. stepfun-ai/Step-Audio-2-mini (Thanks to @CJack812 for the contribution)
What's Changed
- Merge ulysses and ring-attention by @tastelikefeet in #5522
- [bugfix] fix text_position_ids by @Jintao-Huang in #5692
- [grpo] support CHORD algorithm by @hjh0119 in #5680
- [doc] update chord doc by @hjh0119 in #5701
- [bugfix]: use GCD to robustly configure sp and rp dimensions for any world_size by @0russwest0 in #5698
- [megatron] Fix SP & LoRA by @Jintao-Huang in #5704
- [megatron] Support ovis2.5 by @Jintao-Huang in #5719
- [template] update get_env_args & load_from_cache_file by @Jintao-Huang in #5730
- [bugfix] fix qwen3
swift pt
by @Jintao-Huang in #5741 - fix sp grpo by @tastelikefeet in #5744
- Fix multiple input issue and more_params for web-ui by @slin000111 in #5739
- [bugfix] set default padding side to left for generative reranker by @0russwest0 in #5751
- [bugfix] correct multi-GPU reranker evaluation metric calculation by @0russwest0 in #5755
- wrap base_model into get_llm_model by @tastelikefeet in #5749
- [bugfix] fix forward_context by @Jintao-Huang in #5757
- [bugfix] update use_barrier -> True by @Jintao-Huang in #5763
- support Seed-OSS-36B-Instruct by @hpsun1109 in #5761
- [bugfix] fix megatron model_type by @Jintao-Huang in #5767
- Refactor grpo padding free by @tastelikefeet in #5769
- Update seed.py by @hpsun1109 in #5725
- [model] Support qwen3_next (transformers) by @Jintao-Huang in #5782
- [megatron] fix text_position_ids by @Jintao-Huang in #5783
- [model] support Step Audio2 mini by @CJack812 in #5731
- [bugfix] update query placeholder in TextCapsEmbPreprocessor by @0russwest0 in #5774
- [BREAKING] refactor embedding template by @0russwest0 in #5787
- [BREAKING] refactor reranker template by @0russwest0 in #5768
- [model] update step_audio_2_mini by @Jintao-Huang in #5790
- [model] Support qwen3Next (megatron) by @Jintao-Huang in #5764
- [bugfix] fix qwen2_5_vl device_map8 by @Jintao-Huang in #5800
- add qwen3 coder agent template by @ray075hl in #5734
- [agent_template] Update qwen3 coder agent template by @Jintao-Huang in #5802
- [bugfix] fix ovis2_5 by @Jintao-Huang in #5803
- Support ernie-thinking and gemma-emb by @tastelikefeet in #5792
- feat: Add DeepSeek V3.1 Agent Template Support by @gakkiri in #5777
- [agent-template] update deepseek v3.1 agent_template by @Jintao-Huang in #5816
- [bugfix] fix margin by @Jintao-Huang in #5817
- [bugfix] fix template extra_kwargs by @Jintao-Huang in #5821
- [grpo] fix log std_zero by @hjh0119 in #5813
- [bugfix] Fix aux loss & (gradient_accumulation_steps & loss_scale) by @Jintao-Huang in #5823
- Add support for Keye-VL-1_5-8B by @hellopahe in #5815
- update requirements by @Jintao-Huang in #5826
- [bugfix] fix SglangEngine by @Jintao-Huang in #5828
- [model] support ring2 ling2 by @Jintao-Huang in #5830
- [template] update mllm template & InternVL-HF by @hjh0119 in #5829
- [bugfix] fix Qwen3ForSequenceClassification zero3 by @hjh0119 in #5820
- fix megatron flash_attn (flash_attention_3) by @Jintao-Huang in #5837
- [bugfix] fix grpo mllm multi turn by @hjh0119 in #5840
- [image] update swift image by @Jintao-Huang in #5847
- [bugfix] fix keye_vl by @Jintao-Huang in #5848
- [bugfix] fix internvl3_hf by @Jintao-Huang in #5852
- [megatron] Support megatron internvl3-hf/internvl3.5-hf by @Jintao-Huang in #5853
- [model] support tongyi deepresearch by @Jintao-Huang in #5854
- [megatron] fix multimodal pp by @Jintao-Huang in #5857
- register qwen3_coder by @mgilmore-relace in #5855
- [bugfix] fix qwen3_next packing(OOM); fix cp by @Jintao-Huang in #5859
- [megatron] fix megatron multimodal pp by @Jintao-Huang in #5862
- [megatron] compat mcore 0.12 by @Jintao-Huang in #5867
- docs(Instruction): add mcore_adapters parameter to export arguments by @zzc0430 in #5870
- [bugfix] fix grpo pt_engine & padding_free by @Jintao-Huang in #5874
- [megatron] Support kimi vl megatron by @Jintao-Huang in #5872
- [bugfix] fix megatron multimodal modules_to_save by @Jintao-Huang in #5876
- Fix: Use DDP for PPO traning will cause AttributeError: 'DistributedDataParallel' object has no attribute 'config' error by @kiritoxkiriko in #5822
- Support reranker inference by @tastelikefeet in #5883
- 修复Windows环境下转换json字典字符串异常 by @liulei08 in #5804
- fix embedding encode by @tastelikefeet in #5885
- [megatron] support megatron seq_cls task_type by @Jintao-Huang in #5759
- [bugfix] fix megatron seq_cls by @Jintao-Huang in #5888
- chord support loss_scale & update template loss_scale is_binary by @hjh0119 in #5886
- [docs] update rejected_tools by @Jintao-Huang in #5878
- [bugfix] Fix circular references by @Jintao-Huang in #5892
- [grpo] Support PYTORCH_CUDA_ALLOC_CONF environment variable by @hjh0119 in #5897
- fix evalscope config dump error by @Yunnglin in #5899
- update wechat by @tastelikefeet in #5903
- compat trl 0.15 by @Jintao-Huang in #5905
- Fix sp non-padding-free by @tastelikefeet in #5906
- [model] support qwen3_next fp8 by @Jintao-Huang in #5909
- fix embedding encode by @tastelikefeet in #5912
- [model] Support Qwen3-Omni (transformers & megatron) by @Jintao-Huang in #5900
- [bugfix] fix qwen3_omni audio packing by @Jintao-Huang in #5918
- [model] support Qwen3-VL (transformers/megatron) by @Jintao-Huang in #5805
- [bugfix] fix mcore to hf by @Jintao-Huang in #5929
- [bugfix] fix omni norm_bbox by @Jintao-Huang in #5930
- [bugfix] fix infer_backend lmdeploy by @Jintao-Huang in #5931
- support Sail-VL2 models by @hjh0119 in #5921
- [bugfix] fix qwen3_vl video test by @Jintao-Huang in #5932
- [chord] support dataset list by @hjh0119 in #5933
- [template] Support image list by @Jintao-Huang in #5954
- [megatron] Support qwen3-vl/qwen3-omni cp by @Jintao-Huang in #5952
- fix galore by @tastelikefeet in #5957
- [dataset] update load_from_cache_file by @Jintao-Huang in #5961
- [shell] update qwen3_omni shell by @Jintao-Huang in #5976
- [tests] add qwen2_5_vl batch_infer test by @Jintao-Huang in #5975
- [bugfix] fix grpo padding_free by @hjh0119 in #5965
- [bugfix] fix json_parse_to_dict by @Jintao-Huang in #5996
- Support emb/reranker/seq_cls padding_free by @tastelikefeet in #6007
- [grpo] fix gspo & rollout template register by @hjh0119 in #6014
- megatron swift support KTO by @kevssim in #5971
- [megatron] optimize dpo main_grad (GPU memory) by @Jintao-Huang in #6027
- [megatron] support vpp by @Jintao-Huang in #5997
- [model] support GLM4.6 by @Jintao-Huang in #6028
- update swift image by @Jintao-Huang in #6030
- [fix] swift eval parameter dataset_args is replaced by eval_dataset_args by @liulei08 in #5969
- [model] support DeepSeek-V3.1-Terminus by @Jintao-Huang in #6031
- [rlhf] kto support padding_free/packing by @Jintao-Huang in #6032
- [model] support Qwen/Qwen3-VL-30B-A3B-Instruct/Thinking by @Jintao-Huang in #6037
- compat vllm 0.11 by @Jintao-Huang in #6043
- [megatron] update megatron kto by @Jintao-Huang in #6036
- [bugfix] fix megatron rope_scaling by @Jintao-Huang in #6056
- compat sglang 0.5.3 by @Jintao-Huang in #6057
- [bugfix] fix reward_model by @Jintao-Huang in #6060
- [bugfix] fix qwen3_omni config by @Jintao-Huang in #6071
- Update FAQ by @slin000111 in #6077
- update docs by @Jintao-Huang in #6073
- Update link for sequence parallel example by @slin000111 in #6078
- compat qwen3_vl zero3 by @Jintao-Huang in #6080
- update z3_leaf_modules by @Jintao-Huang in #6082
- fix multi-modal padding_free for seq_cls by @tastelikefeet in #6087
- fix padding free for reranker by @0russwest0 in #6088
- fix the compute of accuracy for reranker by @0russwest0 in #6089
- [docs] update docs by @Jintao-Huang in #6090
- fix aux loss with ulysses by @tastelikefeet in #6098
- [megatron] support reward model by @Jintao-Huang in #6093
- Fix embedding padding_free by @tastelikefeet in #6100
- [bugfix] fix streaming by @Jintao-Huang in #6104
- [megatron] fix megatron-swift seq_cls by @Jintao-Huang in #6115
New Contributors
- @hpsun1109 made their first contribution in #5761
- @CJack812 made their first contribution in #5731
- @ray075hl made their first contribution in #5734
- @gakkiri made their first contribution in #5777
- @mgilmore-relace made their first contribution in #5855
- @zzc0430 made their first contribution in #5870
- @liulei08 made their first contribution in #5804
Full Changelog: v3.8.0...v3.9.0