modelscope
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README_CN.md‎
Lines changed: 2 additions & 2 deletions b/‎README_CN.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/Customization/自定义数据集.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/source/Customization/自定义数据集.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/source/GetStarted/快速开始.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/GetStarted/快速开始.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/Instruction/ReleaseNote3.0.md‎
Lines changed: 2 additions & 3 deletions b/‎docs/source/Instruction/ReleaseNote3.0.md‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 23 additions & 2 deletions b/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 23 additions & 2 deletions
diff --git a/‎docs/source/Instruction/推理和部署.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/Instruction/推理和部署.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Customization/Custom-dataset.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/source_en/Customization/Custom-dataset.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/source_en/GetStarted/Quick-start.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source_en/GetStarted/Quick-start.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 30 additions & 6 deletions b/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 30 additions & 6 deletions
@@ -67,7 +67,7 @@ You can contact us and communicate with us by adding our group:
 - 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
 - **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
 - **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM for both pure text and multi-modal large models.
+- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both pure text and multi-modal large models.
 - 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
 - **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
 - **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
@@ -83,7 +83,7 @@ You can contact us and communicate with us by adding our group:
 - 🎉 2024.08.12: The SWIFT paper has been published on arXiv, and you can read it [here](https://arxiv.org/abs/2408.05517).
 - 🔥 2024.08.05: Support for using [evalscope](https://github.com/modelscope/evalscope/) as a backend for evaluating large models and multimodal models.
 - 🔥 2024.07.29: Support for using [vllm](https://github.com/vllm-project/vllm) and [lmdeploy](https://github.com/InternLM/lmdeploy) to accelerate inference for large models and multimodal models. When performing infer/deploy/eval, you can specify `--infer_backend vllm/lmdeploy`.
-- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM.
+- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM/PPO.
 - 🔥 2024.02.01: Support for Agent training! The training algorithm is derived from [this paper](https://arxiv.org/pdf/2309.00986.pdf).
 
 
 
@@ -64,7 +64,7 @@
 - 🍊 **轻量训练**：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - **分布式训练**：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - **量化训练**：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法。
+- **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法。
 - 🍓 **多模态训练**：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
 - **界面训练**：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - **插件化与拓展**：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
@@ -78,7 +78,7 @@
 - 🎉 2024.08.12: SWIFT论文已经发布到arXiv上，可以点击[这里](https://arxiv.org/abs/2408.05517)阅读。
 - 🔥 2024.08.05: 支持使用[evalscope](https://github.com/modelscope/evalscope/)作为后端进行大模型和多模态模型的评测。
 - 🔥 2024.07.29: 支持使用[vllm](https://github.com/vllm-project/vllm), [lmdeploy](https://github.com/InternLM/lmdeploy)对大模型和多模态大模型进行推理加速，在infer/deploy/eval时额外指定`--infer_backend vllm/lmdeploy`即可。
-- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练，包括DPO/ORPO/SimPO/CPO/KTO/RM。
+- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练，包括DPO/ORPO/SimPO/CPO/KTO/RM/PPO。
 - 🔥 2024.02.01: 支持Agent训练！训练算法源自这篇[论文](https://arxiv.org/pdf/2309.00986.pdf)。
 
 ## 🛠️ 安装
 
@@ -67,6 +67,14 @@ query-response格式：
 {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
 ```
 
+#### PPO
+
+```jsonl
+{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}]}
+{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}]}
+{"messages": [{"role": "user", "content": "你的名字是什么"}]}
+```
+
 ### 序列分类
 ```jsonl
 {"messages": [{"role": "user", "content": "今天天气真好呀"}], "label": 1}
 
@@ -8,7 +8,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
 - 🍊 轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - 分布式训练：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - 量化训练：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- RLHF训练：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法。
+- RLHF训练：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法。
 - 🍓 多模态训练：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
 - 界面训练：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - 插件化与拓展：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
 
@@ -81,7 +81,6 @@
 
 ## 待完成
 
-1. RM/PPO能力3.0版本尚不支持，请使用2.6.1版本
-2. 自定义数据集评测3.0版本尚不支持，请使用2.6.1版本
-3. Megatron预训练能力3.0版本尚不支持，请使用2.6.1版本
+1. 自定义数据集评测3.0版本尚不支持，请使用2.6.1版本
+2. Megatron预训练能力3.0版本尚不支持，请使用2.6.1版本
 3. 文档和README暂时未更新完整
@@ -50,7 +50,7 @@
 - 🔥max_pixels: 多模态模型图片前处理的最大像素数（H\*W），默认不缩放。
 - tools_prompt: 智能体训练时的工具列表转为system的格式，请参考[智能体训练](./智能体的支持.md)，默认为'react_en'
 - padding_side: 当训练`batch_size>=2`时的padding_side，可选值为'left', 'right'，默认为'right'。（`generate`的batch_size>=2时，只进行左padding）
-- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`，代表所有response（含history）以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md)
+- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`，代表所有response（含history）以1计算交叉熵损失。可选值为'default', 'last_round', 'all', 以及agent需要的loss_scale: 'react', 'agentflan', 'alpha_umi', 'qwen'。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md)
 - sequence_parallel_size: 序列并行数量。参考[example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh)
 - use_chat_template: 使用chat模板或generation模板，默认为`True`。`swift pt`会自动设置为generation模板
 - template_backend: 使用swift或jinja进行推理。如果使用jinja，则使用transformers的`apply_chat_template`。默认为swift
@@ -307,7 +307,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 ### RLHF参数
 RLHF参数继承于[训练参数](#训练参数)
 
-- 🔥rlhf_type: 对齐算法类型，支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`
+- 🔥rlhf_type: 对齐算法类型，支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo`
 - ref_model: DPO等算法中的原始对比模型
 - ref_model_type: 同model_type
 - ref_model_revision: 同model_revision
@@ -324,6 +324,27 @@ RLHF参数继承于[训练参数](#训练参数)
 - desirable_weight: KTO算法中对desirable response的loss权重 $\lambda_D$ ，默认为`1.`
 - undesirable_weight: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为`1.`
 
+#### PPO参数
+- reward_model: 默认为None
+- reward_adapters: 默认为`[]`
+- reward_model_type: 默认为None
+- reward_model_revision: 默认为None
+
+以下参数含义可以参考[这里](https://huggingface.co/docs/trl/main/ppo_trainer)
+- num_ppo_epochs: 默认为4
+- whiten_rewards: 默认为False
+- kl_coef: 默认为0.05
+- cliprange: 默认为0.2
+- vf_coef: 默认为0.1
+- cliprange_value: 默认为0.2
+- gamma: 默认为1.0
+- lam: 默认为0.95
+- num_mini_batches: 默认为1
+- local_rollout_forward_batch_size: 默认为64
+- num_sample_generations: 默认为10
+- response_length: 默认为512
+- temperature: 默认为0.7
+- missing_eos_penalty: 默认为None
 
 ### 推理参数
 
 
@@ -4,7 +4,7 @@ SWIFT支持以命令行、Python代码和界面方式进行推理和部署：
 - 使用`engine.infer`或者`engine.infer_async`进行python的方式推理. 参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).
 - 使用`swift infer`使用命令行的方式进行推理. 参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/cli_demo.sh).
 - 使用`swift deploy`进行服务部署，并使用openai API或者`client.infer`的方式推理. 服务端参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/server), 客户端参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/client).
-- 使用`swift app`部署模型进行界面推理, 可以查看[这里](../GetStarted/界面使用.md)
+- 使用`swift app`部署模型进行界面推理, 可以查看[这里](../GetStarted/Web-UI.md)
 
 
 ## 命令行推理指令
 
@@ -66,6 +66,14 @@ The following provides the recommended dataset format for ms-swift, where the sy
 {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true}
 ```
 
+#### PPO
+
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
+{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
+{"messages": [{"role": "user", "content": "What is your name?"}]}
+```
+
 ### Sequence Classification
 ```jsonl
 {"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1}
 
@@ -8,7 +8,7 @@ ms-swift is a comprehensive training and deployment framework for large language
 - 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
 - Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies.
 - Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM for both text-based and multimodal large models.
+- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both text-based and multimodal large models.
 - 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
 - Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
 - Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
 
@@ -50,7 +50,7 @@ The introduction to command line parameters will cover base arguments, atomic ar
 - 🔥max_pixels: Maximum pixel count for pre-processing images in multimodal models (H*W), default is no scaling.
 - tools_prompt: The list of tools for agent training converted to system format, refer to [Agent Training](./Agent-support.md), default is 'react_en'.
 - padding_side: The padding_side used when training with `batch_size >= 2`, with optional values of 'left' and 'right', defaulting to 'right'. (When the batch_size in `generate` is >= 2, only left padding is applied.)
-- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
+- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. The optional values are 'default', 'last_round', 'all', and the loss scale required by the agent: 'react', 'agentflan', 'alpha_umi', 'qwen'. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
 - sequence_parallel_size: Number of sequence parallelism. Refer to [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh).
 - use_chat_template: Use chat template or generation template, default is `True`. `swift pt` is automatically set to the generation template.
 - template_backend: Use swift or jinja for inference. If using jinja, it will utilize transformers' `apply_chat_template`. Default is swift.
@@ -311,23 +311,47 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 
 RLHF arguments inherit from the [training arguments](#training-arguments).
 
-- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`.
+- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo`.
 - ref_model: Original comparison model in algorithms like DPO.
 - ref_model_type: Same as model_type.
 - ref_model_revision: Same as model_revision.
 
 - 🔥beta: KL regularization term coefficient, default is `None`, i.e., for `simpo` algorithm default is `2.`, for other algorithms default is `0.1`. Refer to the [documentation](./Human-alignment.md) for specifics.
 - label_smoothing: Whether to use DPO smoothing, default value is `0`, generally set between 0~0.5.
--
+
 - 🔥rpo_alpha: Weight for adding sft_loss in DPO, default is `1`. The final loss is `KL_loss + rpo_alpha * sft_loss`.
--
+
 - cpo_alpha: The coefficient of nll loss in CPO/SimPO loss, default is `1.`.
--
+
 - simpo_gamma: Reward margin term in SimPO algorithm, recommended to set between 0.5-1.5 in the paper, default is `1.`.
--
+
 - desirable_weight: Loss weight for desirable response in KTO algorithm $\lambda_D$, default is `1.`.
 - undesirable_weight: Loss weight for undesirable response in KTO paper $\lambda_U$, default is `1.`.
 
+#### PPO Arguments
+
+- reward_model: Defaults to None
+- reward_adapters: Defaults to `[]`
+- reward_model_type: Defaults to None
+- reward_model_revision: Defaults to None
+
+The meanings of the following parameters can be referenced [here](https://huggingface.co/docs/trl/main/ppo_trainer):
+
+- num_ppo_epochs: Defaults to 4
+- whiten_rewards: Defaults to False
+- kl_coef: Defaults to 0.05
+- cliprange: Defaults to 0.2
+- vf_coef: Defaults to 0.1
+- cliprange_value: Defaults to 0.2
+- gamma: Defaults to 1.0
+- lam: Defaults to 0.95
+- num_mini_batches: Defaults to 1
+- local_rollout_forward_batch_size: Defaults to 64
+- num_sample_generations: Defaults to 10
+- response_length: Defaults to 512
+- temperature: Defaults to 0.7
+- missing_eos_penalty: Defaults to None
+
 ### Inference Arguments
 
 Inference arguments include the [base arguments](#base-arguments), [merge arguments](#merge-arguments), [vLLM arguments](#vllm-arguments), [LMDeploy arguments](#LMDeploy-arguments), and also contain the following: