The more I pretrain (SSL), the worse fine-tuned model gets? #9175

riqiang-dp · 2024-05-13T01:31:06Z

riqiang-dp
May 13, 2024

Hi, I'm trying out SSL pretraining for ASR. I have about 50k hours of unlabeled data and 2k hours of transcribed data. With 100k hours of data, first of all I couldn't get the default dataloader to work because CPU runs out of memory within one epoch. Then I train with only contrastive loss, but after about 50k steps SSL contrastive loss kind of starts to stagnate.

I use two checkpoints one from around 40k steps and another from the end of 80k steps to finetune with labeled ASR data. The checkpoint from 80k steps converges much slower.

It seems that the more I pretrain, the worse pretrained model I get. Is it expected? What could I be doing wrong? Furthermore, this is not an isolated case, I've tried other combinations of parameters / models e.g. Conformer, Fast-conformer, and tried smaller dataset for pretraining and with smaller data the pretrained model seems to be better as well.

Answered by nithinraok

May 30, 2024

I am not sure if you are using conformer or fastconformer and the model size. These factors affect the training speed. @pzelasko could you validate lhotse arguments.

I could confirm that for non causal models pretraining definitely helps. With causal models we haven;t experimented much and I am currently training some very large FastConformer models, so I would only know sooner when the training gets finished. But based on my experiments, pretraining will always help for stable training, improved performance and faster convergence.

View full answer

MostafaAhmed98 · 2024-05-22T19:47:21Z

MostafaAhmed98
May 22, 2024

Hello, could you tried to low your learning rate or used another optimizers like AdamW ?

3 replies

riqiang-dp May 24, 2024
Author

The weird thing is, as I lower my learning rate, the loss curve looks like it's overfitting even more quickly. In terms of optimizer I have been using AdamW

MostafaAhmed98 May 25, 2024

what kind of data augmentation you are using ?

riqiang-dp May 27, 2024
Author

I'm using Lhotse dataloader because the default dataloader makes the CPU run out of memory within one epoch. And I'm using min duration of 4, plus

++model.train_ds.cut_into_windows_duration=8 \
++model.train_ds.cut_into_windows_hop=6 \

to constraint the duration to 4-8 secs per utterance.
Otherwise the masaked patch augmentation like the default config for SSL:

  spec_augment:
    _target_: nemo.collections.asr.modules.MaskedPatchAugmentation
    freq_masks: 3
    freq_width: 20
    patch_size: 48
    mask_patches: 0.5

titu1994 · 2024-05-22T20:50:32Z

titu1994
May 22, 2024
Maintainer

@nithinraok can you comment

2 replies

nithinraok May 25, 2024
Maintainer

@riqiang-dp could you please share your training script.

riqiang-dp May 27, 2024
Author

Can't share the exact script but I'm basically using the defaut SSL config and script for conformer, but modified the minimal part to make it a causal cache-aware model with multiple context sizes, and using lhotse data loader. Some specific configs relating to SSL I shared in the other comment. I'm just wondering if the default config in the examples folder has been shown to improve ASR performance? (WER) And if so what kind of data is expected to be used.

riqiang-dp · 2024-05-27T23:13:01Z

riqiang-dp
May 27, 2024
Author

Another problem I ran into is that with the same data sampling and setting, the training phase runs fine but validation fails at the contrastive loss calculation:

NeMo/nemo/collections/asr/losses/ssl_losses/contrastive.py", line 190, in forward
    out_masked_only = out_masked_only.reshape(bs, -1, out_masked_only.shape[-1])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[20, -1, 128]' is invalid for input of size 175360

2 replies

nithinraok May 30, 2024
Maintainer

I haven;t seen this issue, however I noticed similar issue with contrastive loss and provided a fix for it here: #9259

riqiang-dp May 30, 2024
Author

Thanks! I'll check it out. I got around this by using lhotse

riqiang-dp · 2024-05-29T22:02:35Z

riqiang-dp
May 29, 2024
Author

Lastest status is that I trained 120k steps which took several days and took the 80k step checkpoint and 120k step checkpoint to finetune on ASR data, vs training an ASR model from scratch, and although the pretrained checkpoints are faster to converge, they slow down and get surpassed by the model trained from scratch. Nevertheless the 120k step model is better than 80k. See this plot of WER curve:

So I wouldn't say the more I train the worse weights it gets, but still if it's worse than training from random weights it defeats the purpose.

Some more details of this pretraining run:

    python3 speech_pre_training.py \
        --config-name=conformer_small_streaming_ssl \
        model.train_ds.manifest_filepath=data/train.json \
        model.train_ds.max_duration=4 \
        model.train_ds.min_duration=45 \
        model.train_ds.shuffle=true \
        model.train_ds.shuffle_n=2048 \
        model.train_ds.is_tarred=False \
        model.train_ds.num_workers=8 \
        model.train_ds.batch_size=120 \
        model.train_ds.pin_memory=True \
        model.validation_ds.manifest_filepath="[${dev_manifests}]" \
        model.validation_ds.batch_size=20 \
        ++model.validation_ds.min_duration=4 \
        ++model.validation_ds.max_duration=30 \
        model.validation_ds.num_workers=$nj \
        model.optim.lr=${lr} \
        model.optim.sched.warmup_steps=${warmup} \
        ++model.train_ds.use_lhotse=True \
        ++model.train_ds.batch_duration=1000 \
        ++model.train_ds.quadratic_duration=30 \
        ++model.train_ds.num_buckets=30 \
        ++model.train_ds.bucket_buffer_size=10000 \
        ++model.train_ds.shuffle_buffer_size=10000 \
        ++model.train_ds.use_bucketing=True \
        ++model.train_ds.num_cuts_for_bins_estimate=500 \
        ++model.train_ds.cut_into_windows_duration=8 \
        ++model.train_ds.cut_into_windows_hop=6 \
        ++model.loss_list.contrastive.loss.num_negatives=50 \
        ++trainer.use_distributed_sampler=false \
        ++trainer.limit_train_batches=1000 \
        trainer.val_check_interval=1000 \
        trainer.max_steps=300000 \
        trainer.max_epochs=120 \
        trainer.precision=bf16-mixed

  optim:
    name: adamw
    lr: 4
    betas:
    - 0.9
    - 0.98
    weight_decay: 0.0
    sched:
      name: NoamAnnealing
      d_model: 176
      warmup_steps: 40000
      warmup_ratio: null
      min_lr: 1.0e-06

There's also quite a bit of oscillation in the validation loss:

(weird thing is this is also on an out of domain validation set but with more uniform duration, people's speech duration is around 15s per utterance. With in domain validation set there's even more oscillation from my other training runs.

I guess my question is: is pretraining with this contrastive loss supposed to take a long time to converge to a useful checkpoint? (From literature I've seen people pretrain for relatively short duration. Should I further lower LR / increase warmup to stabilize training?

5 replies

nithinraok May 30, 2024
Maintainer

I am not sure if you are using conformer or fastconformer and the model size. These factors affect the training speed. @pzelasko could you validate lhotse arguments.

I could confirm that for non causal models pretraining definitely helps. With causal models we haven;t experimented much and I am currently training some very large FastConformer models, so I would only know sooner when the training gets finished. But based on my experiments, pretraining will always help for stable training, improved performance and faster convergence.

Answer selected by riqiang-dp

itzsimpl May 30, 2024

@riqiang-dp I noticed this ... it seems you may have min and max duration for train_ds mixed up

        model.train_ds.max_duration=4 \
        model.train_ds.min_duration=45 \
        ...
        ++model.validation_ds.min_duration=4 \
        ++model.validation_ds.max_duration=30 \

riqiang-dp May 30, 2024
Author

@itzsimpl Ah sorry it's actually a typo here since I needed to fill out the variables with actual numbers..

riqiang-dp May 30, 2024
Author

I am not sure if you are using conformer or fastconformer and the model size. These factors affect the training speed. @pzelasko could you validate lhotse arguments.

I could confirm that for non causal models pretraining definitely helps. With causal models we haven;t experimented much and I am currently training some very large FastConformer models, so I would only know sooner when the training gets finished. But based on my experiments, pretraining will always help for stable training, improved performance and faster convergence.

Thanks for the reply. I'm using the small conformer here but also experimenting with fastconformer and different sizes. I do notice it stabilizes training from the more recent training runs. However the results I got from large fastconformer wasn't that good for some reason I'm yet to figure out. I've switched to experimenting with the smaller model so that I can iterate faster for now.

How many steps do I need to typically pretrain using SSL to get good downstream performance?

pzelasko Jun 5, 2024
Collaborator

Regarding lhotse arguments, I have a few suggestions.

        model.train_ds.batch_size=120 \
        ++model.train_ds.batch_duration=1000 \

Specifying both batch size and batch duration is generally not recommended (unless you have a good reason to do this): in this mode you have dynamic batch sizes that are capped by batch_size. Together with bucketing you'd generally under-utilize the GPU for shorter sequence lengths.

        ++model.train_ds.cut_into_windows_duration=8 \
        ++model.train_ds.cut_into_windows_hop=6 \
        ++model.train_ds.use_bucketing=True \

With cut_into_windows, you are converting long utterances to fixed-size chunks (+ left-overs). Since your examples shape is going to be mostly fixed, you could turn off bucketing and batch_duration and just use a static batch size instead. Mind that the left-overs may be shorter (for 15s utterance it's 8 + 8 + 5 seconds). If your intention was to handle that using bucketing I would suggest decreasing num buckets to sth much smaller such as 5 as most data will go into 8s bucket.

        ++model.train_ds.num_cuts_for_bins_estimate=500 \

Your bucket estimate would be sub-optimal with 500 examples; you can run scripts/speech_recognition/estimate_duration_bins.py to get optimal duration bins for your data (mind that this is pre-cutting into windows; see my point above).

nithinraok · 2024-06-04T15:19:04Z

nithinraok
Jun 4, 2024
Maintainer

To leverage the benefit of SSL, you should use atleast XL size model with >300M parameters.
See sizes here:

NeMo/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml

Line 31 in df49143

# | XLarge (616M)| 1024 | 8 | 24 | 9 | 1e-3 | 640 | 2 | True |

1 reply

nithinraok Jun 4, 2024
Maintainer

SSL show maximum and clear benefits of faster convergence, best accuracy and stable training only when you combine large datasets with bigger models. So I advice you increase pretraining model size, you already seem to have very good amount of pretraining data (50k) hrs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The more I pretrain (SSL), the worse fine-tuned model gets? #9175

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

The more I pretrain (SSL), the worse fine-tuned model gets? #9175

riqiang-dp May 13, 2024

Replies: 5 comments · 13 replies

MostafaAhmed98 May 22, 2024

riqiang-dp May 24, 2024 Author

MostafaAhmed98 May 25, 2024

riqiang-dp May 27, 2024 Author

titu1994 May 22, 2024 Maintainer

nithinraok May 25, 2024 Maintainer

riqiang-dp May 27, 2024 Author

riqiang-dp May 27, 2024 Author

nithinraok May 30, 2024 Maintainer

riqiang-dp May 30, 2024 Author

riqiang-dp May 29, 2024 Author

nithinraok May 30, 2024 Maintainer

itzsimpl May 30, 2024

riqiang-dp May 30, 2024 Author

riqiang-dp May 30, 2024 Author

pzelasko Jun 5, 2024 Collaborator

nithinraok Jun 4, 2024 Maintainer

nithinraok Jun 4, 2024 Maintainer

riqiang-dp
May 13, 2024

Replies: 5 comments 13 replies

MostafaAhmed98
May 22, 2024

riqiang-dp May 24, 2024
Author

riqiang-dp May 27, 2024
Author

titu1994
May 22, 2024
Maintainer

nithinraok May 25, 2024
Maintainer

riqiang-dp May 27, 2024
Author

riqiang-dp
May 27, 2024
Author

nithinraok May 30, 2024
Maintainer

riqiang-dp May 30, 2024
Author

riqiang-dp
May 29, 2024
Author

nithinraok May 30, 2024
Maintainer

riqiang-dp May 30, 2024
Author

riqiang-dp May 30, 2024
Author

pzelasko Jun 5, 2024
Collaborator

nithinraok
Jun 4, 2024
Maintainer

nithinraok Jun 4, 2024
Maintainer