[QUESTION]I cannot figure out TE(transformer_engine) #1077
Replies: 12 comments 2 replies
-
You can set a --transformer-impl local flag There is a --transformer-impl argument in megatron/training/arguments.py L707 Megatron-LM/megatron/training/arguments.py Line 707 in b76a7d3 |
Beta Was this translation helpful? Give feedback.
-
Thanks!
but fail: |
Beta Was this translation helpful? Give feedback.
-
For your arguments.py, is there the --transformer_impl argument? |
Beta Was this translation helpful? Give feedback.
-
yes. |
Beta Was this translation helpful? Give feedback.
-
I have installed |
Beta Was this translation helpful? Give feedback.
-
Is Apex needed? |
Beta Was this translation helpful? Give feedback.
-
I think it's not necessary |
Beta Was this translation helpful? Give feedback.
-
ok. |
Beta Was this translation helpful? Give feedback.
-
You'd better use PyTorch NGC container as your environment (which contains te and apex and so on), but I confirm that there's some bugs with megatron here. |
Beta Was this translation helpful? Give feedback.
-
ok. Thanks. |
Beta Was this translation helpful? Give feedback.
-
agree. |
Beta Was this translation helpful? Give feedback.
-
After i install apex, the error changes to: |
Beta Was this translation helpful? Give feedback.
-
I have not installed nvidia transformer_engine. I try running some example scripts (e.g train_mixtral_8x7b_distributed.sh) but fail with the TE error:
I have read Megatron-LM/megatron/core/models/gpt/gpt_layer_specs.py and find that HAVE_TE has been set to False:
But it is useless because the code is judged by:
use_te = args.transformer_impl == "transformer_engine"
or
if args.transformer_impl == "transformer_engine" then
I don't know how to set args.transformer_impl and what choices can be set.
Or Megatron-LM must be run with nvidia transformer_engine?
Beta Was this translation helpful? Give feedback.
All reactions