Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169

zhehuaichen · 2024-05-11T02:46:50Z

What does this PR do ?

In multimodal/speech_llm, add lhotse dataloader support and two models, SALM-T5 and Bestow-GPT. Include example configs.

Main features under speech_llm

Lhotse dataloader support for speech SFT in speech_llm
SALM-style architecture with T5 LLM backbone
Bestow-style architecture (cross-attention based) with GPT LLM backbone

Minor edit in nlp collection:

megatron_base_model.py: handle the case tokenizer.type is not set
megatron_lm_encoder_decoder_model.py: hanlde the case encoder_input is used
megatron_base_prompt_learning_model.py: group the llm init code under init_model function (follow the pattern from megatron_gpt_prompt_learning_model.py) so that it can be overwritten by subclass when needed
megatron/utils.py: in gradient accumulation, handle the case where the batch size from dynamic bucketing is not divisible. This happens when using lhotse dataloader with batch_duration

Collection: [common,nlp,multimodal]

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: zhehuaichen <[email protected]>

Signed-off-by: stevehuang52 <[email protected]>

Signed-off-by: zhehuaichen <[email protected]>

…utterance IDs

for more information, see https://pre-commit.ci

Signed-off-by: zhehuaichen <[email protected]>

for more information, see https://pre-commit.ci

… feature/lhotse-integration

… cut.custom)

for more information, see https://pre-commit.ci

…sko/nemo into feature/lhotse-integration

for more information, see https://pre-commit.ci

Signed-off-by: zhehuaichen <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: zhehuaichen <[email protected]>

Signed-off-by: stevehuang52 <[email protected]>

Signed-off-by: zhehuaichen <[email protected]>

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_cross_llama_lhotse.yaml

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml

examples/multimodal/speech_llm/conf/modular_audio_t5_config.yaml

nemo/collections/multimodal/speech_llm/models/modular_models_t5.py

pzelasko · 2024-05-15T16:10:08Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+ vectors = collate_vectors_lhotse(items, padding_value=padding_value)
+ if max_length > vectors.size(1):
+ vectors = torch.cat(
+ [vectors, padding_value * torch.ones(vectors.size(0), max_length - vectors.size(1), dtype=vectors.dtype)],


why do we need to enforce a static shape with padding for every example here?

to be consistent with the behavior in mtron dataloader and AudioTextDataset in nemo/collections/multimodal/speech_llm/data/audio_text_dataset.py

pzelasko · 2024-05-15T16:13:14Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+ return (n + m - 1) // m * m
+
+
+class TextProcessing:


This class needs more documentation on what is it doing, how to use its API, and what are the expected input and output formats. Also, it only has private methods right now, the main API method should be public (no underscore at the beginning).

I'd expect a docstring of kind: this class is used to convert X to Y. in order to do so, it performs A, B, C, and D. the expect format of X is .... the expected format of Y is ...

since it's used to convert text to prompts to token ids, I'd like to see full documentation of the prompt template/schema

the options to init also need documentation, if some are unused/unnecessary they may be removed

Done. Keep the private function to follow the interface of _process_example from nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

nemo/collections/multimodal/speech_llm/data/build_dataset.py

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml

stevehuang52

Thanks for the great work, please address the CodeQL issues and see the minor comments.

nemo/collections/multimodal/speech_llm/data/build_dataset.py

stevehuang52 · 2024-05-16T08:01:05Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+ return (n + m - 1) // m * m
+
+
+class TextProcessing:


Is this a copy or a modified version of the TextProcessing class in `audio_text_dataset? If it's a modified version, we should inherit from the parent class and only overwrite the necessary functions.

I am not sure whether making lhotse dataloader part inherits nemo mtron dataloader is a good idea for future since we have decided to move away from nemo one at some point. Added docstring of the cls. Hope it helps.

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

nemo/collections/multimodal/speech_llm/modules/perception_modules.py

nemo/collections/multimodal/speech_llm/parts/utils/data_utils.py

nemo/collections/nlp/modules/common/megatron/utils.py

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

Signed-off-by: zhehuaichen <[email protected]>

zhehuaichen and others added 30 commits November 26, 2023 10:01

support ConcatAmQueryAudioPerceptionModel

f4191b1

Signed-off-by: zhehuaichen <[email protected]>

support rnnt by setting ++model.perception.is_ctc=False

1d293b0

Signed-off-by: zhehuaichen <[email protected]>

update for speaker counting and misc

e54641f

Signed-off-by: stevehuang52 <[email protected]>

Merge branch 'main' into feature/lhotse-integration

dbe4542

fix and update infer decoding, add clap encoder (WIP)

5c811b4

Signed-off-by: stevehuang52 <[email protected]>

update cfg

df325e7

Signed-off-by: stevehuang52 <[email protected]>

backward support and support overwrite question

3b24e67

Signed-off-by: zhehuaichen <[email protected]>

add more variants

cf874fb

Signed-off-by: zhehuaichen <[email protected]>

Merge branch 'main' into feature/lhotse-integration

f17ea65

Fix a possible issue when multiple datasets have items with the same …

5c3b329

…utterance IDs

[pre-commit.ci] auto fixes from pre-commit.com hooks

f2ccae4

for more information, see https://pre-commit.ci

support use_multi_layer_feat

0d06689

Signed-off-by: zhehuaichen <[email protected]>

Support selectable text field and static batch sizes with Lhotse

2065169

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb39189

for more information, see https://pre-commit.ci

Fixes

b1923e0

Merge remote-tracking branch 'origin/feature/lhotse-integration' into…

c1f2d68

… feature/lhotse-integration

Fixes

0c7b399

Docs fix

3b282aa

Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to…

5034d77

… cut.custom)

Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to…

31b1973

… cut.custom)

[pre-commit.ci] auto fixes from pre-commit.com hooks

0880d44

for more information, see https://pre-commit.ci

Merge branch 'feature/lhotse-integration' of https://github.com/pzela…

30ce202

…sko/nemo into feature/lhotse-integration

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f11fdb

for more information, see https://pre-commit.ci

support distributed_fused_adam

02f0f0a

Signed-off-by: zhehuaichen <[email protected]>

support distributed_fused_adam

378af7c

Signed-off-by: zhehuaichen <[email protected]>

Add support for sharded NeMo manifest files

35412fb

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f2acde

for more information, see https://pre-commit.ci

support megatron_amp_O2

5b58e69

Signed-off-by: zhehuaichen <[email protected]>

Support heterogeneous sampling rates in non tarred NeMo manifests

37cabcc

migrate to PTL2.0

1270609

Signed-off-by: stevehuang52 <[email protected]>

github-actions bot added NLP Multi Modal labels May 11, 2024

Apply isort and black reformatting

3cc0432

Signed-off-by: zhehuaichen <[email protected]>

github-advanced-security bot found potential problems May 11, 2024

View reviewed changes

zhehuaichen requested review from titu1994, pzelasko, stevehuang52, arendu and nithinraok May 11, 2024 04:02

zhehuaichen marked this pull request as ready for review May 11, 2024 04:03

zhehuaichen requested review from krishnacpuvvada and blisc May 11, 2024 04:04

stevehuang52 reviewed May 13, 2024

View reviewed changes

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_cross_llama_lhotse.yaml Outdated Show resolved Hide resolved

stevehuang52 reviewed May 13, 2024

View reviewed changes

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml Outdated Show resolved Hide resolved

stevehuang52 reviewed May 13, 2024

View reviewed changes

examples/multimodal/speech_llm/conf/modular_audio_t5_config.yaml Outdated Show resolved Hide resolved

stevehuang52 reviewed May 13, 2024

View reviewed changes

nemo/collections/multimodal/speech_llm/models/modular_models_t5.py Outdated Show resolved Hide resolved