Skip to content

Commit 7540743

Browse files
committed
merge main
2 parents 2aaa1e5 + 82a2b22 commit 7540743

File tree

17 files changed

+195
-85
lines changed

17 files changed

+195
-85
lines changed

docs/source/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
275275
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。该参数只对`vit_gradient_checkpointing`生效。
276276
- 🔥packing: 是否使用序列packing提升计算效率(不同节点与进程更负载均衡,GPU利用率更高;但需要额外的预处理时间)并稳定显存占用,默认为False。当前支持CPT/SFT/DPO/KTO/RM。
277277
- 注意:**同一batch的不同序列之间依旧是不可见的**,除了Qwen3-Next。
278-
- 注意:**packing会导致数据集样本数减少,请自行调节梯度累加数和学习率**
278+
- 注意:**packing会导致数据集样本数减少,请自行调节global_batch_size和学习率**
279279
- packing_length: packing的长度。默认为None,设置为max_length。
280280
- packing_num_proc: packing的进程数,默认为1。需要注意的是,不同的`packing_num_proc`,最终形成的packed数据集是不同的。(该参数在流式packing时不生效)
281281
- streaming: 流式读取并处理数据集,默认False。

docs/source/Megatron-SWIFT/Quick-start.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ pip install --no-build-isolation transformer_engine[pytorch]
2727
# pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5#egg=transformer_engine[pytorch]
2828

2929
# apex
30+
# 提示:Megatron-SWIFT可以在不含apex的环境下运行,额外设置`--no_gradient_accumulation_fusion true`即可。
3031
git clone https://github.com/NVIDIA/apex
3132
cd apex
3233
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
@@ -65,7 +66,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
6566
| torch | >=2.0 | 2.7.1/2.8.0 | |
6667
| transformer_engine | >=2.3 | | |
6768
| apex | | 0.1 | |
68-
| megatron_core | | 0.14 | |
69+
| megatron_core | >=0.12 | 0.14 | |
6970
| flash_attn | | 2.8.1/3.0.0b1 | |
7071
| transformers | >=4.33 | 4.57.1 | |
7172
| modelscope | >=1.23 | | |

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
293293
- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to `None`. This parameter only takes effect when `vit_gradient_checkpointing` is enabled.
294294
- 🔥packing: Whether to use sequence packing to improve computational efficiency (achieving better load balancing across nodes and processes, and higher GPU utilization), at the cost of additional preprocessing time, while also stabilizing GPU memory usage. Defaults to `False`. Currently supported for CPT, SFT, DPO, KTO and RM.
295295
- Note: **Sequences within the same batch remain mutually invisible**, except for Qwen3-Next.
296-
- Note: **Packing reduces the number of samples in the dataset; please adjust the gradient accumulation steps and learning rate accordingly**.
296+
- Note: **Packing will reduce the number of dataset samples. Please adjust global_batch_size and learning rate accordingly**.
297297
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
298298
- packing_num_proc: Number of processes for packing, default is 1. Note that different values of `packing_num_proc` will result in different packed datasets. (This parameter does not take effect during streaming packing)
299299
- streaming: Stream data loading and processing, default is False.

docs/source_en/Megatron-SWIFT/Quick-start.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ pip install --no-build-isolation transformer_engine[pytorch]
2626
# pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5#egg=transformer_engine[pytorch]
2727

2828
# apex
29+
# Note: Megatron-SWIFT can run in environments without apex by setting `--no_gradient_accumulation_fusion true`.
2930
git clone https://github.com/NVIDIA/apex
3031
cd apex
3132
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
@@ -65,7 +66,7 @@ Recommended Operating Environment:
6566
| torch | >=2.0 | 2.7.1/2.8.0 | |
6667
| transformer_engine | >=2.3 | | |
6768
| apex | | 0.1 | |
68-
| megatron_core | | 0.14 | |
69+
| megatron_core | >=0.12 | 0.14 | |
6970
| flash_attn | | 2.8.1/3.0.0b1 | |
7071
| transformers | >=4.33 | 4.57.1 | |
7172
| modelscope | >=1.23 | | |

examples/models/qwen3_next/mcore.sh

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
1111
NPROC_PER_NODE=8 \
1212
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
1313
megatron sft \
14-
--load Qwen3-Next-80B-A3B-Instruct-mcore \
14+
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
15+
--load_safetensors true \
16+
--save_safetensors true \
17+
--merge_lora false \
1518
--dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT#2000' \
1619
'swift/self-cognition#1000' \
1720
--load_from_cache_file true \
@@ -23,7 +26,7 @@ megatron sft \
2326
--moe_permute_fusion true \
2427
--moe_grouped_gemm true \
2528
--moe_shared_expert_overlap true \
26-
--moe_aux_loss_coeff 1e-3 \
29+
--moe_aux_loss_coeff 1e-6 \
2730
--micro_batch_size 2 \
2831
--global_batch_size 16 \
2932
--recompute_granularity full \
@@ -47,3 +50,9 @@ megatron sft \
4750
--attention_backend flash \
4851
--model_author swift \
4952
--model_name swift-robot
53+
54+
55+
# CUDA_VISIBLE_DEVICES=0,1,2,3 \
56+
# swift infer \
57+
# --adapters megatron_output/Qwen3-Next-80B-A3B-Instruct/vx-xxx/checkpoint-xxx \
58+
# --stream true

swift/megatron/argument/megatron_args.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -638,6 +638,8 @@ def __post_init__(self):
638638
MegatronTunerMixin.__post_init__(self)
639639
os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = '1'
640640
self._set_default()
641+
if self.optimizer_cpu_offload:
642+
require_version('megatron-core>=0.13')
641643
self.model_info, self.model_meta = get_model_info_meta(
642644
self.model, model_type=self.model_type, use_hf=self.use_hf, hub_token=self.hub_token)
643645
self.model_type = self.model_info.model_type

swift/megatron/init.py

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def _patch_mla_attention():
6666
gather_from_tensor_model_parallel_region,
6767
scatter_to_sequence_parallel_region,
6868
)
69-
megatron_core_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
69+
mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
7070

7171
# Code borrowed from NVIDIA/Megatron-LM
7272
def forward(
@@ -112,7 +112,7 @@ def forward(
112112
# Adjust key, value for inference
113113
# ===================================================
114114
# rotary_pos_emb = None
115-
if megatron_core_013:
115+
if mcore_013:
116116
query, key, value, _, attn_mask_type, _ = self._adjust_key_value_for_inference(
117117
inference_context, query, key, value, rotary_pos_emb=None)
118118
else:
@@ -430,7 +430,7 @@ def _patch_TransformerLayer():
430430
from megatron.training import get_args
431431
from megatron.core.transformer import TransformerLayer
432432
_origin_forward = TransformerLayer.forward
433-
megatron_core_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
433+
mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
434434

435435
def forward(self, *_args, **kwargs):
436436
"""
@@ -439,7 +439,7 @@ def forward(self, *_args, **kwargs):
439439
This method calls the core computation of a transformer layer, including
440440
self-attention, cross-attention (if applicable), and feed-forward operations.
441441
"""
442-
if not megatron_core_013:
442+
if not mcore_013:
443443
return _origin_forward(self, *_args, **kwargs)
444444
hidden_states, context = self._forward_attention(*_args, **kwargs)
445445
args = get_args()
@@ -551,11 +551,14 @@ def build_train_valid_test_datasets(build_train_valid_test_datasets_provider):
551551
def _patch_mrope():
552552
from megatron.core.models.common.embeddings.rotary_pos_embedding import MultimodalRotaryEmbedding
553553
from megatron.core import parallel_state
554+
import megatron.core
554555
from megatron.core.models.common.embeddings.rope_utils import (get_pos_emb_on_this_cp_rank,
555556
_apply_rotary_pos_emb_bshd)
556557
from megatron.core.models.common.embeddings import rope_utils
557558
from megatron.training import get_args
558559

560+
mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
561+
559562
# Code borrowed from huggingface/transformers
560563
def apply_interleaved_mrope(freqs, mrope_section):
561564
"""Apply interleaved MRoPE to 3D rotary embeddings.
@@ -638,24 +641,25 @@ def _apply_rotary_pos_emb_thd(
638641
Returns:
639642
Tensor: Shape [t, h, d]. The input tensor after applying RoPE.
640643
"""
641-
use_batched_rope = False
642644
if cp_group is not None:
643645
cp_size = cp_group.size()
644-
cu_seqlens_for_batched = cu_seqlens // cp_size
645-
use_batched_rope = (freqs.dim() >= 1 and freqs.shape[0] == cu_seqlens_for_batched[-1]).item()
646+
else:
647+
args = get_args()
648+
cp_size = args.context_parallel_size
649+
cu_seqlens_for_batched = cu_seqlens // cp_size
650+
use_batched_rope = (freqs.dim() >= 1 and freqs.shape[0] == cu_seqlens_for_batched[-1]).item()
646651
if not use_batched_rope:
647652
logger.warning_once('Using non-batched RoPE, which may affect performance.')
653+
kwargs = {'cp_group': cp_group} if mcore_013 else {}
648654
return _origin_apply_rotary_pos_emb_thd(
649655
t,
650656
cu_seqlens,
651657
freqs,
652658
rotary_interleaved=rotary_interleaved,
653659
multi_latent_attention=multi_latent_attention,
654660
mscale=mscale,
655-
cp_group=cp_group,
661+
**kwargs,
656662
)
657-
if cp_group is None:
658-
raise ValueError('cp_group must be provided for THD format RoPE')
659663

660664
return _apply_rotary_pos_emb_bshd(
661665
t.unsqueeze(1),

swift/megatron/model/gpt/qwen3_next.py

Lines changed: 45 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from copy import deepcopy
33
from typing import Optional, Tuple, Union
44

5+
import megatron.core
56
import torch
67
from megatron.core.extensions.transformer_engine import TEColumnParallelLinear, TENorm, _get_extra_te_kwargs
78
from megatron.core.inference.contexts import BaseInferenceContext
@@ -17,13 +18,15 @@
1718
from megatron.core.transformer.transformer_layer import get_transformer_layer_offset
1819
from megatron.core.utils import deprecate_inference_params, is_fa_min_version
1920
from megatron.training import get_args
21+
from packaging import version
2022

2123
from swift.llm import ModelType
2224
from swift.utils import get_logger
2325
from ..constant import MegatronModelType
2426
from ..gpt_bridge import GPTBridge
2527
from ..register import MegatronModelMeta, register_megatron_model
2628

29+
mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
2730
try:
2831
from flashattn_hopper.flash_attn_interface import _flash_attn_forward
2932
from flashattn_hopper.flash_attn_interface import flash_attn_with_kvcache as flash_attn3_with_kvcache
@@ -58,6 +61,7 @@ class Qwen3NextSelfAttention(SelfAttention):
5861

5962
def __init__(self, config: TransformerConfig, submodules: SelfAttentionSubmodules, *args, **kwargs):
6063
super(SelfAttention, self).__init__(config, submodules, *args, attention_type='self', **kwargs)
64+
kwargs = {'tp_group': self.model_comm_pgs.tp} if mcore_013 else {}
6165
self.linear_qkv = build_module(
6266
submodules.linear_qkv,
6367
self.config.hidden_size,
@@ -69,7 +73,7 @@ def __init__(self, config: TransformerConfig, submodules: SelfAttentionSubmodule
6973
skip_bias_add=False,
7074
is_expert=False,
7175
tp_comm_buffer_name='qkv',
72-
tp_group=self.model_comm_pgs.tp,
76+
**kwargs,
7377
)
7478

7579
if submodules.q_layernorm is not None:
@@ -130,12 +134,22 @@ def forward(
130134
(Tuple[Tensor, Tensor]) Attention output and bias.
131135
132136
"""
133-
from megatron.core.utils import nvtx_range_pop, nvtx_range_push
137+
try:
138+
from megatron.core.utils import nvtx_range_pop, nvtx_range_push
139+
except ImportError:
140+
141+
def nvtx_range_pop(*args, **kwargs):
142+
return
143+
144+
def nvtx_range_push(*args, **kwargs):
145+
return
146+
134147
# Check if we need to skip RoPE
135148
# no_rope is 0-indexed array and self.layer_number is 1-indexed
136-
no_rope = (self.config.no_rope_freq[self.layer_number - 1] if self.config.no_rope_freq else False)
137-
if no_rope:
138-
rotary_pos_emb = None
149+
if hasattr(self.config, 'no_rope_freq'):
150+
no_rope = (self.config.no_rope_freq[self.layer_number - 1] if self.config.no_rope_freq else False)
151+
if no_rope:
152+
rotary_pos_emb = None
139153

140154
inference_context = deprecate_inference_params(inference_context, inference_params)
141155

@@ -194,17 +208,20 @@ def forward(
194208
if (in_decode_mode and self.config.enable_cuda_graph and inference_context.is_static_batching()):
195209
raise ValueError('CUDA graphs must use flash decode with static batching!')
196210

197-
query, key, value, rotary_pos_emb, attn_mask_type, block_table = (
198-
self._adjust_key_value_for_inference(
199-
inference_context,
200-
query,
201-
key,
202-
value,
203-
rotary_pos_emb,
204-
rotary_pos_cos,
205-
rotary_pos_sin,
206-
sequence_len_offset,
207-
))
211+
result = self._adjust_key_value_for_inference(
212+
inference_context,
213+
query,
214+
key,
215+
value,
216+
rotary_pos_emb,
217+
rotary_pos_cos,
218+
rotary_pos_sin,
219+
sequence_len_offset,
220+
)
221+
if mcore_013:
222+
query, key, value, rotary_pos_emb, attn_mask_type, block_table = result
223+
else:
224+
query, key, value, rotary_pos_emb, attn_mask_type = result
208225

209226
if packed_seq_params is not None:
210227
query = query.squeeze(1)
@@ -215,6 +232,7 @@ def forward(
215232
# ================================================
216233
# relative positional embedding (rotary embedding)
217234
# ================================================
235+
kwargs = {'cp_group': self.model_comm_pgs.cp} if mcore_013 else {}
218236
nvtx_range_push(suffix='rotary_pos_emb')
219237
if rotary_pos_emb is not None and not self.config.flash_decode:
220238
q_pos_emb, k_pos_emb = rotary_pos_emb
@@ -239,18 +257,18 @@ def forward(
239257
q_pos_emb,
240258
config=self.config,
241259
cu_seqlens=cu_seqlens_q,
242-
cp_group=self.model_comm_pgs.cp,
260+
**kwargs,
243261
)
244262
else:
245263
query = inference_context.apply_rotary_emb_query(query, q_pos_emb, self.config, cu_seqlens_q,
246-
self.model_comm_pgs.cp)
264+
**kwargs)
247265
if k_pos_emb is not None:
248266
key = apply_rotary_pos_emb(
249267
key,
250268
k_pos_emb,
251269
config=self.config,
252270
cu_seqlens=cu_seqlens_kv,
253-
cp_group=self.model_comm_pgs.cp,
271+
**kwargs,
254272
)
255273

256274
# TODO, can apply positional embedding to value_layer so it has
@@ -418,16 +436,17 @@ def forward(self, hidden_states: torch.Tensor, **kwargs):
418436

419437

420438
def get_local_layer_specs(config, layer_specs, vp_stage=None):
421-
from megatron.core.transformer.enums import LayerType
422-
num_layers_to_build = get_num_layers_to_build(config, vp_stage=vp_stage)
439+
kwargs = {'vp_stage': vp_stage} if mcore_013 else {}
440+
num_layers_to_build = get_num_layers_to_build(config, **kwargs)
423441

424-
if config.pipeline_model_parallel_layout is not None:
442+
if getattr(config, 'pipeline_model_parallel_layout', None) is not None:
443+
from megatron.core.transformer.enums import LayerType
425444
local_layer_specs = [
426445
layer_specs[layer_id] for layer_id in config.pipeline_model_parallel_layout.get_layer_id_list(
427-
layer_type=LayerType.decoder, vp_stage=vp_stage)
446+
layer_type=LayerType.decoder, **kwargs)
428447
]
429448
else:
430-
offset = get_transformer_layer_offset(config, vp_stage=vp_stage)
449+
offset = get_transformer_layer_offset(config, **kwargs)
431450
local_layer_specs = layer_specs[offset:offset + num_layers_to_build]
432451
return local_layer_specs
433452

@@ -446,13 +465,14 @@ def get_qwen3_next_transformer_layer_spec(config, vp_stage=None):
446465
config.linear_conv_kernel_dim = args.linear_conv_kernel_dim
447466

448467
layer_norm_impl = TENorm
468+
kwargs = {'use_kitchen': config.use_kitchen} if mcore_013 else {}
449469
moe_layer_spec = get_gpt_layer_with_transformer_engine_spec(
450470
num_experts=config.num_moe_experts,
451471
moe_grouped_gemm=config.moe_grouped_gemm,
452472
qk_layernorm=config.qk_layernorm,
453473
multi_latent_attention=config.multi_latent_attention,
454474
moe_use_legacy_grouped_gemm=config.moe_use_legacy_grouped_gemm,
455-
use_kitchen=config.use_kitchen,
475+
**kwargs,
456476
)
457477
layer_specs = []
458478
for layer_type in args.layer_types:

0 commit comments

Comments
 (0)