You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train an MoE model through the moe branch on the wikitext-103 dataset given. The example training script for MoE models does not generate "checkpoint_best.pt" or "checkpoint_last.pt" files, which are needed for the evaluation script.
I believe the reason why I was missing these files is because of the "--disable-validation" flag given in the default script, as it seems that the best checkpoint is determined by validation scores.
I removed this flag and attempted to train an MoE model with the following command:
The model finishes training with no issues, but when it reaches the validation stage at the end I get this error:
Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 191, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 318, in train
valid_losses, should_stop = validate_and_save(
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 403, in validate_and_save
valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 497, in validate
trainer.valid_step(sample)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/meenakshi/MoE/fairseq/fairseq/trainer.py", line 1031, in valid_step
assert sample['net_input']['src_tokens'].shape[1] == fixed_src_seq_length,
AssertionError: got src_seq_length 459, expected 512
Things I tried:
This error does not occur when training a base transformer model. I tried to use all of the same parameters/flags used in the example transformer training script, with the additional moe flags, and then the error occurs.
I tried a few different values for "tokens-per-sample" like 2048, 1024, and 512. Same error each time.
I tried writing a padding script that adds pad tokens to each sample until it reaches the "tokens-per-sample" value. This script was either flawed or this approach simply doesn't work, as I got "exploding loss" errors when I started training.
Environment details:
fairseq Version: moe branch
PyTorch Version: 2.0.1
OS: Linux
How you installed fairseq: source
Build command you used: pip install --editable ./
Python version: 3.9.19
CUDA/cuDNN version: 3.7
GPU models and configuration: 8 GPUs: Tesla P100-PCIE-16GB
The text was updated successfully, but these errors were encountered:
I am trying to train an MoE model through the moe branch on the wikitext-103 dataset given. The example training script for MoE models does not generate "checkpoint_best.pt" or "checkpoint_last.pt" files, which are needed for the evaluation script.
I believe the reason why I was missing these files is because of the "--disable-validation" flag given in the default script, as it seems that the best checkpoint is determined by validation scores.
I removed this flag and attempted to train an MoE model with the following command:
fairseq-train --task language_modeling
data-bin/wikitext-103
--save-dir checkpoints/moe_wikitext-103-best
--tokens-per-sample 512
--ddp-backend fully_sharded --memory-efficient-fp16 --checkpoint-activations
--arch transformer_lm --share-decoder-input-output-embed
--decoder-layers 24 --decoder-embed-dim 1024 --decoder-ffn-embed-dim 4096
--decoder-attention-heads 16
--moe-expert-count 8 --moe-freq 2
--moe-gating-use-fp32 --moe-second-expert-policy all
--moe-normalize-expert-grad sqrt_world_size
--moe-eval-capacity-token-fraction -1.0
--max-sentences-valid 1 --num-workers-valid 0
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01
--moe-gate-loss-combine-method sum
--optimizer adam --fp16 --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr 0.0005 --warmup-updates 750
--sample-break-mode none --dropout 0.2 --attention-dropout 0.2
--batch-size 2 --update-freq 2
--max-update 250 --log-format json --log-interval 10
The model finishes training with no issues, but when it reaches the validation stage at the end I get this error:
Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 191, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 318, in train
valid_losses, should_stop = validate_and_save(
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 403, in validate_and_save
valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/train.py", line 497, in validate
trainer.valid_step(sample)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/meenakshi/MoE/fairseq/fairseq/trainer.py", line 1031, in valid_step
assert sample['net_input']['src_tokens'].shape[1] == fixed_src_seq_length,
AssertionError: got src_seq_length 459, expected 512
Things I tried:
Environment details:
The text was updated successfully, but these errors were encountered: