Release v3.9.0 · modelscope/ms-swift

中文版

新特性

Megatron-SWIFT
a. 支持更多模型架构：Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF等。完整的模型支持情况，参考支持的模型文档：https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. 支持KTO训练，包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。感谢招商银行技术团队@kevssim 的贡献。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. 支持RM训练，包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. 支持序列分类模型架构，包括三种任务：regression、single_label_classification、multi_label_classification。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. 支持VPP并行技术，减少PP并行的计算空泡，提高GPU利用率，但会略微提高通信量。支持异构PP并行 pipeline_model_parallel_layout，自定义流水线并行（PP/VPP）布局。
f. DPO等RLHF技术中的ref_model不初始化 main_grad 降低显存占用。
训练
a. 序列并行优化，ulysses 和 ring-attention 支持混合使用，实现更长的序列处理能力。支持纯文本和多模态模型的SFT/DPO/GRPO训练。训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. 纯文本及多模态模型Embedding/Reranker/序列分类任务训练支持使用 padding_free 以节约显存资源并加速训练。
c. Embedding和Reranker训练数据集格式重构，具体参考文档：https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent template支持更多模型：deepseek_v3_1, qwen3_coder。（感谢@gakkiri ,@ray075hl 的贡献）
e. load_from_cache_file 默认值从True改成False，避免因缓存原因导致的未知问题。
RLHF
a. GRPO支持CHORD算法，在GRPO训练中混合SFT训练，参考文档：https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO支持padding free和packing以节约显存资源并加速训练。
c. GRPO训练 padding_free重构，更好支持多模态模型。
d. GRPO vLLM 支持PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"环境变量，减小显存碎片。
推理
a. 支持Reranker任务的推理/部署 (pt/vllm)，以及序列分类任务的推理部署（pt/vllm）。脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls

新模型

纯文本模型
a. Qwen/Qwen3-Next-80B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0系列
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct系列（感谢@hpsun1109 的贡献）
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m（embedding模型）
多模态模型
a. Qwen/Qwen3-VL-30B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B（感谢@hellopahe 的贡献）
d. OpenGVLab/InternVL3_5-1B-HF系列
e. BytedanceDouyinContent/SAIL-VL2-2B系列
f. stepfun-ai/Step-Audio-2-mini（感谢@CJack812 的贡献）

English Version

New Features

Megatron-SWIFT
a. More model architecture support: Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF, etc. For a complete list of supported models, please refer to the Supported Models documentation: https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. KTO training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Special thanks to @kevssim from China Merchants Bank’s technical team for their contribution. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. Reward Model training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. Sequence classification model architecture support, covering three task types: regression, single_label_classification, and multi_label_classification. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. Support for VPP (Virtual Pipeline Parallelism): reduces pipeline bubbles in PP (Pipeline Parallelism), improving GPU utilization at the cost of slightly increased communication overhead. Supports heterogeneous PP via pipeline_model_parallel_layout for custom PP/VPP pipeline layouts.
f. In RLHF techniques such as DPO, the ref_model no longer initializes main_grad, reducing GPU memory consumption.
Training
a. Sequence parallelism optimization: Ulysses and Ring Attention can now be used together, enabling processing of even longer sequences. Supports SFT/DPO/GRPO training for both text-only and multimodal models. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. Padding-free training is now supported for embedding, reranker, and sequence classification tasks on both text-only and multimodal models, saving GPU memory and accelerating training.
c. Restructured dataset formats for embedding and reranker training. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent templates support more models: deepseek_v3_1, qwen3_coder. (Thanks to contributions from @gakkiri and @ray075hl)
e. Default value of load_from_cache_file changed from True to False to avoid unexpected issues caused by caching.
RLHF
a. GRPO now supports the CHORD algorithm, enabling mixed SFT training during GRPO. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO supports padding-free and packing, reducing memory usage and accelerating training.
c. Padding-free implementation in GRPO has been refactored for better multimodal model support.
d. GRPO with vLLM now supports the environment variable PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" to reduce GPU memory fragmentation.
Inference
a. Inference and deployment support for Reranker tasks (PyTorch/vLLM) and sequence classification tasks (PyTorch/vLLM). Example scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls

New Models

Text-only Models
a. Qwen/Qwen3-Next-80B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0 series
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct series (Thanks to @hpsun1109 for the contribution)
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m (embedding model)
Multimodal Models
a. Qwen/Qwen3-VL-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B (Thanks to @hellopahe for the contribution)
d. OpenGVLab/InternVL3_5-1B-HF series
e. BytedanceDouyinContent/SAIL-VL2-2B series
f. stepfun-ai/Step-Audio-2-mini (Thanks to @CJack812 for the contribution)

What's Changed

Merge ulysses and ring-attention by @tastelikefeet in #5522
[bugfix] fix text_position_ids by @Jintao-Huang in #5692
[grpo] support CHORD algorithm by @hjh0119 in #5680
[doc] update chord doc by @hjh0119 in #5701
[bugfix]: use GCD to robustly configure sp and rp dimensions for any world_size by @0russwest0 in #5698
[megatron] Fix SP & LoRA by @Jintao-Huang in #5704
[megatron] Support ovis2.5 by @Jintao-Huang in #5719
[template] update get_env_args & load_from_cache_file by @Jintao-Huang in #5730
[bugfix] fix qwen3 swift pt by @Jintao-Huang in #5741
fix sp grpo by @tastelikefeet in #5744
Fix multiple input issue and more_params for web-ui by @slin000111 in #5739
[bugfix] set default padding side to left for generative reranker by @0russwest0 in #5751
[bugfix] correct multi-GPU reranker evaluation metric calculation by @0russwest0 in #5755
wrap base_model into get_llm_model by @tastelikefeet in #5749
[bugfix] fix forward_context by @Jintao-Huang in #5757
[bugfix] update use_barrier -> True by @Jintao-Huang in #5763
support Seed-OSS-36B-Instruct by @hpsun1109 in #5761
[bugfix] fix megatron model_type by @Jintao-Huang in #5767
Refactor grpo padding free by @tastelikefeet in #5769
Update seed.py by @hpsun1109 in #5725
[model] Support qwen3_next (transformers) by @Jintao-Huang in #5782
[megatron] fix text_position_ids by @Jintao-Huang in #5783
[model] support Step Audio2 mini by @CJack812 in #5731
[bugfix] update query placeholder in TextCapsEmbPreprocessor by @0russwest0 in #5774
[BREAKING] refactor embedding template by @0russwest0 in #5787
[BREAKING] refactor reranker template by @0russwest0 in #5768
[model] update step_audio_2_mini by @Jintao-Huang in #5790
[model] Support qwen3Next (megatron) by @Jintao-Huang in #5764
[bugfix] fix qwen2_5_vl device_map8 by @Jintao-Huang in #5800
add qwen3 coder agent template by @ray075hl in #5734
[agent_template] Update qwen3 coder agent template by @Jintao-Huang in #5802
[bugfix] fix ovis2_5 by @Jintao-Huang in #5803
Support ernie-thinking and gemma-emb by @tastelikefeet in #5792
feat: Add DeepSeek V3.1 Agent Template Support by @gakkiri in #5777
[agent-template] update deepseek v3.1 agent_template by @Jintao-Huang in #5816
[bugfix] fix margin by @Jintao-Huang in #5817
[bugfix] fix template extra_kwargs by @Jintao-Huang in #5821
[grpo] fix log std_zero by @hjh0119 in #5813
[bugfix] Fix aux loss & (gradient_accumulation_steps & loss_scale) by @Jintao-Huang in #5823
Add support for Keye-VL-1_5-8B by @hellopahe in #5815
update requirements by @Jintao-Huang in #5826
[bugfix] fix SglangEngine by @Jintao-Huang in #5828
[model] support ring2 ling2 by @Jintao-Huang in #5830
[template] update mllm template & InternVL-HF by @hjh0119 in #5829
[bugfix] fix Qwen3ForSequenceClassification zero3 by @hjh0119 in #5820
fix megatron flash_attn (flash_attention_3) by @Jintao-Huang in #5837
[bugfix] fix grpo mllm multi turn by @hjh0119 in #5840
[image] update swift image by @Jintao-Huang in #5847
[bugfix] fix keye_vl by @Jintao-Huang in #5848
[bugfix] fix internvl3_hf by @Jintao-Huang in #5852
[megatron] Support megatron internvl3-hf/internvl3.5-hf by @Jintao-Huang in #5853
[model] support tongyi deepresearch by @Jintao-Huang in #5854
[megatron] fix multimodal pp by @Jintao-Huang in #5857
register qwen3_coder by @mgilmore-relace in #5855
[bugfix] fix qwen3_next packing(OOM); fix cp by @Jintao-Huang in #5859
[megatron] fix megatron multimodal pp by @Jintao-Huang in #5862
[megatron] compat mcore 0.12 by @Jintao-Huang in #5867
docs(Instruction): add mcore_adapters parameter to export arguments by @zzc0430 in #5870
[bugfix] fix grpo pt_engine & padding_free by @Jintao-Huang in #5874
[megatron] Support kimi vl megatron by @Jintao-Huang in #5872
[bugfix] fix megatron multimodal modules_to_save by @Jintao-Huang in #5876
Fix: Use DDP for PPO traning will cause AttributeError: 'DistributedDataParallel' object has no attribute 'config' error by @kiritoxkiriko in #5822
Support reranker inference by @tastelikefeet in #5883
修复Windows环境下转换json字典字符串异常 by @liulei08 in #5804
fix embedding encode by @tastelikefeet in #5885
[megatron] support megatron seq_cls task_type by @Jintao-Huang in #5759
[bugfix] fix megatron seq_cls by @Jintao-Huang in #5888
chord support loss_scale & update template loss_scale is_binary by @hjh0119 in #5886
[docs] update rejected_tools by @Jintao-Huang in #5878
[bugfix] Fix circular references by @Jintao-Huang in #5892
[grpo] Support PYTORCH_CUDA_ALLOC_CONF environment variable by @hjh0119 in #5897
fix evalscope config dump error by @Yunnglin in #5899
update wechat by @tastelikefeet in #5903
compat trl 0.15 by @Jintao-Huang in #5905
Fix sp non-padding-free by @tastelikefeet in #5906
[model] support qwen3_next fp8 by @Jintao-Huang in #5909
fix embedding encode by @tastelikefeet in #5912
[model] Support Qwen3-Omni (transformers & megatron) by @Jintao-Huang in #5900
[bugfix] fix qwen3_omni audio packing by @Jintao-Huang in #5918
[model] support Qwen3-VL (transformers/megatron) by @Jintao-Huang in #5805
[bugfix] fix mcore to hf by @Jintao-Huang in #5929
[bugfix] fix omni norm_bbox by @Jintao-Huang in #5930
[bugfix] fix infer_backend lmdeploy by @Jintao-Huang in #5931
support Sail-VL2 models by @hjh0119 in #5921
[bugfix] fix qwen3_vl video test by @Jintao-Huang in #5932
[chord] support dataset list by @hjh0119 in #5933
[template] Support image list by @Jintao-Huang in #5954
[megatron] Support qwen3-vl/qwen3-omni cp by @Jintao-Huang in #5952
fix galore by @tastelikefeet in #5957
[dataset] update load_from_cache_file by @Jintao-Huang in #5961
[shell] update qwen3_omni shell by @Jintao-Huang in #5976
[tests] add qwen2_5_vl batch_infer test by @Jintao-Huang in #5975
[bugfix] fix grpo padding_free by @hjh0119 in #5965
[bugfix] fix json_parse_to_dict by @Jintao-Huang in #5996
Support emb/reranker/seq_cls padding_free by @tastelikefeet in #6007
[grpo] fix gspo & rollout template register by @hjh0119 in #6014
megatron swift support KTO by @kevssim in #5971
[megatron] optimize dpo main_grad (GPU memory) by @Jintao-Huang in #6027
[megatron] support vpp by @Jintao-Huang in #5997
[model] support GLM4.6 by @Jintao-Huang in #6028
update swift image by @Jintao-Huang in #6030
[fix] swift eval parameter dataset_args is replaced by eval_dataset_args by @liulei08 in #5969
[model] support DeepSeek-V3.1-Terminus by @Jintao-Huang in #6031
[rlhf] kto support padding_free/packing by @Jintao-Huang in #6032
[model] support Qwen/Qwen3-VL-30B-A3B-Instruct/Thinking by @Jintao-Huang in #6037
compat vllm 0.11 by @Jintao-Huang in #6043
[megatron] update megatron kto by @Jintao-Huang in #6036
[bugfix] fix megatron rope_scaling by @Jintao-Huang in #6056
compat sglang 0.5.3 by @Jintao-Huang in #6057
[bugfix] fix reward_model by @Jintao-Huang in #6060
[bugfix] fix qwen3_omni config by @Jintao-Huang in #6071
Update FAQ by @slin000111 in #6077
update docs by @Jintao-Huang in #6073
Update link for sequence parallel example by @slin000111 in #6078
compat qwen3_vl zero3 by @Jintao-Huang in #6080
update z3_leaf_modules by @Jintao-Huang in #6082
fix multi-modal padding_free for seq_cls by @tastelikefeet in #6087
fix padding free for reranker by @0russwest0 in #6088
fix the compute of accuracy for reranker by @0russwest0 in #6089
[docs] update docs by @Jintao-Huang in #6090
fix aux loss with ulysses by @tastelikefeet in #6098
[megatron] support reward model by @Jintao-Huang in #6093
Fix embedding padding_free by @tastelikefeet in #6100
[bugfix] fix streaming by @Jintao-Huang in #6104
[megatron] fix megatron-swift seq_cls by @Jintao-Huang in #6115

New Contributors

@hpsun1109 made their first contribution in #5761
@CJack812 made their first contribution in #5731
@ray075hl made their first contribution in #5734
@gakkiri made their first contribution in #5777
@mgilmore-relace made their first contribution in #5855
@zzc0430 made their first contribution in #5870
@liulei08 made their first contribution in #5804

Full Changelog: v3.8.0...v3.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v3.9.0

中文版

新特性

新模型

English Version

New Features

New Models

What's Changed

New Contributors

Contributors

Uh oh!