Skip to content

Commit 344cd17

Browse files
committed
Merge branch 'main' into release/3.0
2 parents b0fde84 + c98e538 commit 344cd17

File tree

35 files changed

+447
-161
lines changed

35 files changed

+447
-161
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ You can contact us and communicate with us by adding our group:
6767
- 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
6868
- **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
6969
- **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
70-
- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM for both pure text and multi-modal large models.
70+
- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both pure text and multi-modal large models.
7171
- 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
7272
- **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
7373
- **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
@@ -83,7 +83,7 @@ You can contact us and communicate with us by adding our group:
8383
- 🎉 2024.08.12: The SWIFT paper has been published on arXiv, and you can read it [here](https://arxiv.org/abs/2408.05517).
8484
- 🔥 2024.08.05: Support for using [evalscope](https://github.com/modelscope/evalscope/) as a backend for evaluating large models and multimodal models.
8585
- 🔥 2024.07.29: Support for using [vllm](https://github.com/vllm-project/vllm) and [lmdeploy](https://github.com/InternLM/lmdeploy) to accelerate inference for large models and multimodal models. When performing infer/deploy/eval, you can specify `--infer_backend vllm/lmdeploy`.
86-
- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM.
86+
- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM/PPO.
8787
- 🔥 2024.02.01: Support for Agent training! The training algorithm is derived from [this paper](https://arxiv.org/pdf/2309.00986.pdf).
8888

8989

README_CN.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
- 🍊 **轻量训练**:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
6565
- **分布式训练**:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
6666
- **量化训练**:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
67-
- **RLHF训练**:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法
67+
- **RLHF训练**:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法
6868
- 🍓 **多模态训练**:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。
6969
- **界面训练**:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。
7070
- **插件化与拓展**:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
@@ -78,7 +78,7 @@
7878
- 🎉 2024.08.12: SWIFT论文已经发布到arXiv上,可以点击[这里](https://arxiv.org/abs/2408.05517)阅读。
7979
- 🔥 2024.08.05: 支持使用[evalscope](https://github.com/modelscope/evalscope/)作为后端进行大模型和多模态模型的评测。
8080
- 🔥 2024.07.29: 支持使用[vllm](https://github.com/vllm-project/vllm), [lmdeploy](https://github.com/InternLM/lmdeploy)对大模型和多模态大模型进行推理加速,在infer/deploy/eval时额外指定`--infer_backend vllm/lmdeploy`即可。
81-
- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练,包括DPO/ORPO/SimPO/CPO/KTO/RM。
81+
- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练,包括DPO/ORPO/SimPO/CPO/KTO/RM/PPO
8282
- 🔥 2024.02.01: 支持Agent训练!训练算法源自这篇[论文](https://arxiv.org/pdf/2309.00986.pdf)
8383

8484
## 🛠️ 安装

docs/source/Customization/自定义数据集.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,14 @@ query-response格式:
6767
{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
6868
```
6969

70+
#### PPO
71+
72+
```jsonl
73+
{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}]}
74+
{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}]}
75+
{"messages": [{"role": "user", "content": "你的名字是什么"}]}
76+
```
77+
7078
### 序列分类
7179
```jsonl
7280
{"messages": [{"role": "user", "content": "今天天气真好呀"}], "label": 1}

docs/source/GetStarted/快速开始.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
88
- 🍊 轻量训练:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
99
- 分布式训练:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
1010
- 量化训练:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
11-
- RLHF训练:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法
11+
- RLHF训练:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法
1212
- 🍓 多模态训练:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。
1313
- 界面训练:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。
1414
- 插件化与拓展:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。

docs/source/Instruction/ReleaseNote3.0.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,6 @@
8181

8282
## 待完成
8383

84-
1. RM/PPO能力3.0版本尚不支持,请使用2.6.1版本
85-
2. 自定义数据集评测3.0版本尚不支持,请使用2.6.1版本
86-
3. Megatron预训练能力3.0版本尚不支持,请使用2.6.1版本
84+
1. 自定义数据集评测3.0版本尚不支持,请使用2.6.1版本
85+
2. Megatron预训练能力3.0版本尚不支持,请使用2.6.1版本
8786
3. 文档和README暂时未更新完整

docs/source/Instruction/命令行参数.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@
5050
- 🔥max_pixels: 多模态模型图片前处理的最大像素数(H\*W),默认不缩放。
5151
- tools_prompt: 智能体训练时的工具列表转为system的格式,请参考[智能体训练](./智能体的支持.md),默认为'react_en'
5252
- padding_side: 当训练`batch_size>=2`时的padding_side,可选值为'left', 'right',默认为'right'。(`generate`的batch_size>=2时,只进行左padding)
53-
- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`,代表所有response(含history)以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件化.md)[智能体训练](./智能体的支持.md)
53+
- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`,代表所有response(含history)以1计算交叉熵损失。可选值为'default', 'last_round', 'all', 以及agent需要的loss_scale: 'react', 'agentflan', 'alpha_umi', 'qwen'。具体可以查看[插件化](../Customization/插件化.md)[智能体训练](./智能体的支持.md)
5454
- sequence_parallel_size: 序列并行数量。参考[example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh)
5555
- use_chat_template: 使用chat模板或generation模板,默认为`True``swift pt`会自动设置为generation模板
5656
- template_backend: 使用swift或jinja进行推理。如果使用jinja,则使用transformers的`apply_chat_template`。默认为swift
@@ -307,7 +307,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
307307
### RLHF参数
308308
RLHF参数继承于[训练参数](#训练参数)
309309

310-
- 🔥rlhf_type: 对齐算法类型,支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`
310+
- 🔥rlhf_type: 对齐算法类型,支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo`
311311
- ref_model: DPO等算法中的原始对比模型
312312
- ref_model_type: 同model_type
313313
- ref_model_revision: 同model_revision
@@ -324,6 +324,27 @@ RLHF参数继承于[训练参数](#训练参数)
324324
- desirable_weight: KTO算法中对desirable response的loss权重 $\lambda_D$ ,默认为`1.`
325325
- undesirable_weight: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为`1.`
326326

327+
#### PPO参数
328+
- reward_model: 默认为None
329+
- reward_adapters: 默认为`[]`
330+
- reward_model_type: 默认为None
331+
- reward_model_revision: 默认为None
332+
333+
以下参数含义可以参考[这里](https://huggingface.co/docs/trl/main/ppo_trainer)
334+
- num_ppo_epochs: 默认为4
335+
- whiten_rewards: 默认为False
336+
- kl_coef: 默认为0.05
337+
- cliprange: 默认为0.2
338+
- vf_coef: 默认为0.1
339+
- cliprange_value: 默认为0.2
340+
- gamma: 默认为1.0
341+
- lam: 默认为0.95
342+
- num_mini_batches: 默认为1
343+
- local_rollout_forward_batch_size: 默认为64
344+
- num_sample_generations: 默认为10
345+
- response_length: 默认为512
346+
- temperature: 默认为0.7
347+
- missing_eos_penalty: 默认为None
327348

328349
### 推理参数
329350

docs/source/Instruction/推理和部署.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ SWIFT支持以命令行、Python代码和界面方式进行推理和部署:
44
- 使用`engine.infer`或者`engine.infer_async`进行python的方式推理. 参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).
55
- 使用`swift infer`使用命令行的方式进行推理. 参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/cli_demo.sh).
66
- 使用`swift deploy`进行服务部署,并使用openai API或者`client.infer`的方式推理. 服务端参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/server), 客户端参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/client).
7-
- 使用`swift app`部署模型进行界面推理, 可以查看[这里](../GetStarted/界面使用.md)
7+
- 使用`swift app`部署模型进行界面推理, 可以查看[这里](../GetStarted/Web-UI.md)
88

99

1010
## 命令行推理指令

docs/source_en/Customization/Custom-dataset.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,14 @@ The following provides the recommended dataset format for ms-swift, where the sy
6666
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true}
6767
```
6868

69+
#### PPO
70+
71+
```jsonl
72+
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
73+
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
74+
{"messages": [{"role": "user", "content": "What is your name?"}]}
75+
```
76+
6977
### Sequence Classification
7078
```jsonl
7179
{"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1}

docs/source_en/GetStarted/Quick-start.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms-swift is a comprehensive training and deployment framework for large language
88
- 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
99
- Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies.
1010
- Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
11-
- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM for both text-based and multimodal large models.
11+
- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both text-based and multimodal large models.
1212
- 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
1313
- Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
1414
- Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ The introduction to command line parameters will cover base arguments, atomic ar
5050
- 🔥max_pixels: Maximum pixel count for pre-processing images in multimodal models (H*W), default is no scaling.
5151
- tools_prompt: The list of tools for agent training converted to system format, refer to [Agent Training](./Agent-support.md), default is 'react_en'.
5252
- padding_side: The padding_side used when training with `batch_size >= 2`, with optional values of 'left' and 'right', defaulting to 'right'. (When the batch_size in `generate` is >= 2, only left padding is applied.)
53-
- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
53+
- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. The optional values are 'default', 'last_round', 'all', and the loss scale required by the agent: 'react', 'agentflan', 'alpha_umi', 'qwen'. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
5454
- sequence_parallel_size: Number of sequence parallelism. Refer to [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh).
5555
- use_chat_template: Use chat template or generation template, default is `True`. `swift pt` is automatically set to the generation template.
5656
- template_backend: Use swift or jinja for inference. If using jinja, it will utilize transformers' `apply_chat_template`. Default is swift.
@@ -311,23 +311,47 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
311311

312312
RLHF arguments inherit from the [training arguments](#training-arguments).
313313

314-
- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`.
314+
- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo`.
315315
- ref_model: Original comparison model in algorithms like DPO.
316316
- ref_model_type: Same as model_type.
317317
- ref_model_revision: Same as model_revision.
318318

319319
- 🔥beta: KL regularization term coefficient, default is `None`, i.e., for `simpo` algorithm default is `2.`, for other algorithms default is `0.1`. Refer to the [documentation](./Human-alignment.md) for specifics.
320320
- label_smoothing: Whether to use DPO smoothing, default value is `0`, generally set between 0~0.5.
321-
-
321+
322322
- 🔥rpo_alpha: Weight for adding sft_loss in DPO, default is `1`. The final loss is `KL_loss + rpo_alpha * sft_loss`.
323-
-
323+
324324
- cpo_alpha: The coefficient of nll loss in CPO/SimPO loss, default is `1.`.
325-
-
325+
326326
- simpo_gamma: Reward margin term in SimPO algorithm, recommended to set between 0.5-1.5 in the paper, default is `1.`.
327-
-
327+
328328
- desirable_weight: Loss weight for desirable response in KTO algorithm $\lambda_D$, default is `1.`.
329329
- undesirable_weight: Loss weight for undesirable response in KTO paper $\lambda_U$, default is `1.`.
330330

331+
#### PPO Arguments
332+
333+
- reward_model: Defaults to None
334+
- reward_adapters: Defaults to `[]`
335+
- reward_model_type: Defaults to None
336+
- reward_model_revision: Defaults to None
337+
338+
The meanings of the following parameters can be referenced [here](https://huggingface.co/docs/trl/main/ppo_trainer):
339+
340+
- num_ppo_epochs: Defaults to 4
341+
- whiten_rewards: Defaults to False
342+
- kl_coef: Defaults to 0.05
343+
- cliprange: Defaults to 0.2
344+
- vf_coef: Defaults to 0.1
345+
- cliprange_value: Defaults to 0.2
346+
- gamma: Defaults to 1.0
347+
- lam: Defaults to 0.95
348+
- num_mini_batches: Defaults to 1
349+
- local_rollout_forward_batch_size: Defaults to 64
350+
- num_sample_generations: Defaults to 10
351+
- response_length: Defaults to 512
352+
- temperature: Defaults to 0.7
353+
- missing_eos_penalty: Defaults to None
354+
331355
### Inference Arguments
332356

333357
Inference arguments include the [base arguments](#base-arguments), [merge arguments](#merge-arguments), [vLLM arguments](#vllm-arguments), [LMDeploy arguments](#LMDeploy-arguments), and also contain the following:

0 commit comments

Comments
 (0)