You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
69
69
-**Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
70
-
-**RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM for both pure text and multi-modal large models.
70
+
-**RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both pure text and multi-modal large models.
71
71
- 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
72
72
-**Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
73
73
-**Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
@@ -83,7 +83,7 @@ You can contact us and communicate with us by adding our group:
83
83
- 🎉 2024.08.12: The SWIFT paper has been published on arXiv, and you can read it [here](https://arxiv.org/abs/2408.05517).
84
84
- 🔥 2024.08.05: Support for using [evalscope](https://github.com/modelscope/evalscope/) as a backend for evaluating large models and multimodal models.
85
85
- 🔥 2024.07.29: Support for using [vllm](https://github.com/vllm-project/vllm) and [lmdeploy](https://github.com/InternLM/lmdeploy) to accelerate inference for large models and multimodal models. When performing infer/deploy/eval, you can specify `--infer_backend vllm/lmdeploy`.
86
-
- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM.
86
+
- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM/PPO.
87
87
- 🔥 2024.02.01: Support for Agent training! The training algorithm is derived from [this paper](https://arxiv.org/pdf/2309.00986.pdf).
Copy file name to clipboardExpand all lines: docs/source_en/GetStarted/Quick-start.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ ms-swift is a comprehensive training and deployment framework for large language
8
8
- 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
9
9
- Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies.
10
10
- Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
11
-
- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM for both text-based and multimodal large models.
11
+
- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both text-based and multimodal large models.
12
12
- 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
13
13
- Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
14
14
- Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
Copy file name to clipboardExpand all lines: docs/source_en/Instruction/Command-line-parameters.md
+30-6Lines changed: 30 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,7 +50,7 @@ The introduction to command line parameters will cover base arguments, atomic ar
50
50
- 🔥max_pixels: Maximum pixel count for pre-processing images in multimodal models (H*W), default is no scaling.
51
51
- tools_prompt: The list of tools for agent training converted to system format, refer to [Agent Training](./Agent-support.md), default is 'react_en'.
52
52
- padding_side: The padding_side used when training with `batch_size >= 2`, with optional values of 'left' and 'right', defaulting to 'right'. (When the batch_size in `generate` is >= 2, only left padding is applied.)
53
-
- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
53
+
- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. The optional values are 'default', 'last_round', 'all', and the loss scale required by the agent: 'react', 'agentflan', 'alpha_umi', 'qwen'. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
54
54
- sequence_parallel_size: Number of sequence parallelism. Refer to [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh).
55
55
- use_chat_template: Use chat template or generation template, default is `True`. `swift pt` is automatically set to the generation template.
56
56
- template_backend: Use swift or jinja for inference. If using jinja, it will utilize transformers' `apply_chat_template`. Default is swift.
@@ -311,23 +311,47 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
311
311
312
312
RLHF arguments inherit from the [training arguments](#training-arguments).
- ref_model: Original comparison model in algorithms like DPO.
316
316
- ref_model_type: Same as model_type.
317
317
- ref_model_revision: Same as model_revision.
318
318
319
319
- 🔥beta: KL regularization term coefficient, default is `None`, i.e., for `simpo` algorithm default is `2.`, for other algorithms default is `0.1`. Refer to the [documentation](./Human-alignment.md) for specifics.
320
320
- label_smoothing: Whether to use DPO smoothing, default value is `0`, generally set between 0~0.5.
321
-
-
321
+
322
322
- 🔥rpo_alpha: Weight for adding sft_loss in DPO, default is `1`. The final loss is `KL_loss + rpo_alpha * sft_loss`.
323
-
-
323
+
324
324
- cpo_alpha: The coefficient of nll loss in CPO/SimPO loss, default is `1.`.
325
-
-
325
+
326
326
- simpo_gamma: Reward margin term in SimPO algorithm, recommended to set between 0.5-1.5 in the paper, default is `1.`.
327
-
-
327
+
328
328
- desirable_weight: Loss weight for desirable response in KTO algorithm $\lambda_D$, default is `1.`.
329
329
- undesirable_weight: Loss weight for undesirable response in KTO paper $\lambda_U$, default is `1.`.
330
330
331
+
#### PPO Arguments
332
+
333
+
- reward_model: Defaults to None
334
+
- reward_adapters: Defaults to `[]`
335
+
- reward_model_type: Defaults to None
336
+
- reward_model_revision: Defaults to None
337
+
338
+
The meanings of the following parameters can be referenced [here](https://huggingface.co/docs/trl/main/ppo_trainer):
339
+
340
+
- num_ppo_epochs: Defaults to 4
341
+
- whiten_rewards: Defaults to False
342
+
- kl_coef: Defaults to 0.05
343
+
- cliprange: Defaults to 0.2
344
+
- vf_coef: Defaults to 0.1
345
+
- cliprange_value: Defaults to 0.2
346
+
- gamma: Defaults to 1.0
347
+
- lam: Defaults to 0.95
348
+
- num_mini_batches: Defaults to 1
349
+
- local_rollout_forward_batch_size: Defaults to 64
350
+
- num_sample_generations: Defaults to 10
351
+
- response_length: Defaults to 512
352
+
- temperature: Defaults to 0.7
353
+
- missing_eos_penalty: Defaults to None
354
+
331
355
### Inference Arguments
332
356
333
357
Inference arguments include the [base arguments](#base-arguments), [merge arguments](#merge-arguments), [vLLM arguments](#vllm-arguments), [LMDeploy arguments](#LMDeploy-arguments), and also contain the following:
0 commit comments