Fix qwen3 vl sp (modelscope#6514)

tastelikefeet · tastelikefeet · web-flow · commit e53831c9f847 · 2025-11-10T11:35:09.000+08:00
Co-authored-by: tastelikefeet &lt;yuze.zyz@alibaab-inc.com&gt;
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -843,4 +843,4 @@ qwen2_5_omni除了包含qwen2_5_vl和qwen2_audio的模型特定参数外，还
 - VLLM_USE_V1: 用于切换vLLM使用V0/V1版本。
 - SWIFT_TIMEOUT: (ms-swift>=3.10) 若多模态数据集中存在图像URL，该参数用于控制获取图片的timeout，默认为20s。
 - ROOT_IMAGE_DIR: (ms-swift>=3.8) 图像（多模态）资源的根目录。通过设置该参数，可以在数据集中使用相对于 `ROOT_IMAGE_DIR` 的相对路径。默认情况下，是相对于运行目录的相对路径。
-- SWIFT_SINGLE_DEVICE_MODE: (ms-swift>=3.10) 单设备模式，在此模式下，所有进程只能看到一个设备，目前用于兼容PPU设备
+- SWIFT_SINGLE_DEVICE_MODE: (ms-swift>=3.10) 单设备模式，在此模式下，每个进程只能看到一个设备，目前用于兼容PPU设备
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -868,4 +868,4 @@ The meanings of the following parameters can be found in the example code [here]
 - VLLM_USE_V1: Used to switch between V0 and V1 versions of vLLM.
 - SWIFT_TIMEOUT: (ms-swift >= 3.10) If the multimodal dataset contains image URLs, this parameter controls the timeout for fetching images, defaulting to 20 seconds.
 - ROOT_IMAGE_DIR: (ms-swift>=3.8) The root directory for image (multimodal) resources. By setting this parameter, relative paths in the dataset can be interpreted relative to `ROOT_IMAGE_DIR`. By default, paths are relative to the current working directory.
-- SWIFT_SINGLE_DEVICE_MODE: (ms-swift>=3.10) Single device mode. In this mode, all processes can only see one device. Currently used for compatibility with PPU devices.
+- SWIFT_SINGLE_DEVICE_MODE: (ms-swift>=3.10) Single device mode. In this mode, each process can only see one device. Currently used for compatibility with PPU devices.
diff --git a/swift/llm/model/model/qwen.py b/swift/llm/model/model/qwen.py
@@ -936,7 +936,7 @@ def _patch_deepstack_process(model):
     def _deepstack_process(self, hidden_states: torch.Tensor, visual_pos_masks: torch.Tensor,
                            visual_embeds: torch.Tensor):
         from swift.trainers.sequence_parallel import sequence_parallel
-        if sequence_parallel.world_size:
+        if sequence_parallel.world_size and visual_pos_masks is not None:
             visual_pos_masks, visual_embeds = sequence_parallel.pad_and_split_mm_tokens(visual_pos_masks, visual_embeds)
         if visual_pos_masks is None:
             return hidden_states + visual_embeds.mean() * 0