[QUESTION]I cannot figure out TE(transformer_engine) #1077

Guodanding · 2024-08-29T07:03:56Z

Guodanding
Aug 29, 2024

I have not installed nvidia transformer_engine. I try running some example scripts (e.g train_mixtral_8x7b_distributed.sh) but fail with the TE error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/pretrain_gpt.py", line 246, in <module>
[rank0]:     pretrain(
[rank0]:   File "/share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/megatron/training/training.py", line 270, in pretrain
[rank0]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank0]:   File "/share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/megatron/training/training.py", line 581, in setup_model_and_optimizer
[rank0]:     model = get_model(model_provider_func, model_type)
[rank0]:   File "/share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/megatron/training/training.py", line 455, in get_model
[rank0]:     model = model_provider_func(
[rank0]:   File "/share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/pretrain_gpt.py", line 74, in model_provider
[rank0]:     transformer_layer_spec = get_gpt_layer_with_transformer_engine_spec(args.num_experts, args.moe_grouped_gemm, args.qk_layernorm)
[rank0]:   File "/share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/megatron/core/models/gpt/gpt_layer_specs.py", line 63, in get_gpt_layer_with_transformer_engine_spec
[rank0]:     mlp = _get_mlp_module_spec(
[rank0]:   File "/share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/megatron/core/models/gpt/gpt_layer_specs.py", line 154, in _get_mlp_module_spec
[rank0]:     linear_fc1 = TEColumnParallelGroupedLinear
[rank0]: NameError: name 'TEColumnParallelGroupedLinear' is not defined

I have read Megatron-LM/megatron/core/models/gpt/gpt_layer_specs.py and find that HAVE_TE has been set to False:

try:
    from megatron.core.transformer.custom_layers.transformer_engine import (
        TEColumnParallelGroupedLinear,
        TEDotProductAttention,
        TELayerNormColumnParallelLinear,
        TENorm,
        TERowParallelGroupedLinear,
        TERowParallelLinear,
    )

    HAVE_TE = True
except ImportError:

    HAVE_TE = False

But it is useless because the code is judged by:
use_te = args.transformer_impl == "transformer_engine"
or
if args.transformer_impl == "transformer_engine" then
I don't know how to set args.transformer_impl and what choices can be set.
Or Megatron-LM must be run with nvidia transformer_engine?

clarence-lee-sheng · 2024-08-29T07:39:15Z

clarence-lee-sheng
Aug 29, 2024

You can set a --transformer-impl local flag

There is a --transformer-impl argument in megatron/training/arguments.py L707
(

Megatron-LM/megatron/training/arguments.py

Line 707 in b76a7d3

group.add_argument('--transformer-impl', default='transformer_engine',

)

0 replies

Guodanding · 2024-08-29T07:55:41Z

Guodanding
Aug 29, 2024
Author

You can set a --transformer-impl local flag

There is a --transformer-impl argument in megatron/training/arguments.py L707 (

Megatron-LM/megatron/training/arguments.py

Line 707 in b76a7d3

group.add_argument('--transformer-impl', default='transformer_engine',
)

Thanks!
I know i can set default to 'local' to achieve my goal. But is there 'standard' way to set the args? I have try this in train_mixtral_8x7b_distributed.sh:

ENGINE_ARGS=(
    --transformer_impl local \
)

torchrun ${DISTRIBUTED_ARGS[@]} /share/home/wenjunyi/guokb/workspace/imple-moe/Megatron-LM/pretrain_gpt.py \
    ${MODEL_ARGS[@]} \
    ${MOE_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${LOGGING_ARGS[@]} \
    ${ENGINE_ARGS[@]}

but fail:
pretrain_gpt.py: error: unrecognized arguments: --transformer_impl local

1 reply

yuyangJin Nov 10, 2024

--transformer_impl => --transformer-impl

clarence-lee-sheng · 2024-08-29T11:16:55Z

clarence-lee-sheng
Aug 29, 2024

For your arguments.py, is there the --transformer_impl argument?

0 replies

Guodanding · 2024-08-29T11:44:36Z

Guodanding
Aug 29, 2024
Author

For your arguments.py, is there the --transformer_impl argument?

yes.

0 replies

1195343015 · 2024-08-30T02:37:15Z

1195343015
Aug 30, 2024

I have installed transformer_engine, but the same error.

0 replies

Guodanding · 2024-08-30T02:57:39Z

Guodanding
Aug 30, 2024
Author

Is Apex needed?

0 replies

1195343015 · 2024-08-30T03:02:18Z

1195343015
Aug 30, 2024

I think it's not necessary

0 replies

Guodanding · 2024-08-30T03:09:45Z

Guodanding
Aug 30, 2024
Author

ok.
In fact, after i set --transformer_impl argument to local, i meet another error:
[rank1]: AssertionError: (RMSNorm) is not supported in by torch Layernorm when instantiating WrappedTorchLayerNorm
So i am wondeing if i miss some needed packages.

0 replies

1195343015 · 2024-08-30T03:15:05Z

1195343015
Aug 30, 2024

You'd better use PyTorch NGC container as your environment (which contains te and apex and so on), but I confirm that there's some bugs with megatron here.

0 replies

Guodanding · 2024-08-30T03:23:29Z

Guodanding
Aug 30, 2024
Author

ok. Thanks.

0 replies

Guodanding · 2024-08-30T03:24:46Z

Guodanding
Aug 30, 2024
Author

I confirm that there's some bugs with megatron here.

agree.

0 replies

Guodanding · 2024-08-30T06:14:34Z

Guodanding
Aug 30, 2024
Author

After i install apex, the error changes to:
[rank0]: AssertionError: (RMSNorm) is not supported in FusedLayerNorm when instantiating FusedLayerNorm

1 reply

JavaZeroo Oct 17, 2024

Did you solve the problem?

eliird · 2024-12-21T19:52:50Z

eliird
Dec 21, 2024

Set the --normalization argument to LayerNorm. Local does not support RMSNorm implementation

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]I cannot figure out TE(transformer_engine) #1077

{{title}}

Replies: 13 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[QUESTION]I cannot figure out TE(transformer_engine) #1077

Replies: 13 comments · 2 replies

Guodanding Aug 29, 2024 Author

Guodanding Aug 29, 2024 Author

Guodanding Aug 30, 2024 Author

Guodanding Aug 30, 2024 Author

Guodanding Aug 30, 2024 Author

Guodanding Aug 30, 2024 Author

Guodanding Aug 30, 2024 Author

Replies: 13 comments 2 replies

Guodanding
Aug 29, 2024
Author

Guodanding
Aug 29, 2024
Author

Guodanding
Aug 30, 2024
Author

Guodanding
Aug 30, 2024
Author

Guodanding
Aug 30, 2024
Author

Guodanding
Aug 30, 2024
Author

Guodanding
Aug 30, 2024
Author