Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169

Open
wants to merge 454 commits into
base: main
Choose a base branch
from

Conversation

zhehuaichen
Copy link
Collaborator

What does this PR do ?

In multimodal/speech_llm, add lhotse dataloader support and two models, SALM-T5 and Bestow-GPT. Include example configs.

Main features under speech_llm

  • Lhotse dataloader support for speech SFT in speech_llm
  • SALM-style architecture with T5 LLM backbone
  • Bestow-style architecture (cross-attention based) with GPT LLM backbone

Minor edit in nlp collection:

  • megatron_base_model.py: handle the case tokenizer.type is not set
  • megatron_lm_encoder_decoder_model.py: hanlde the case encoder_input is used
  • megatron_base_prompt_learning_model.py: group the llm init code under init_model function (follow the pattern from megatron_gpt_prompt_learning_model.py) so that it can be overwritten by subclass when needed
  • megatron/utils.py: in gradient accumulation, handle the case where the batch size from dynamic bucketing is not divisible. This happens when using lhotse dataloader with batch_duration

Collection: [common,nlp,multimodal]

PR Type:

  • New Feature
  • Bugfix
  • Documentation

zhehuaichen and others added 30 commits November 26, 2023 10:01
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@zhehuaichen zhehuaichen marked this pull request as ready for review May 11, 2024 04:03
vectors = collate_vectors_lhotse(items, padding_value=padding_value)
if max_length > vectors.size(1):
vectors = torch.cat(
[vectors, padding_value * torch.ones(vectors.size(0), max_length - vectors.size(1), dtype=vectors.dtype)],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to enforce a static shape with padding for every example here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the behavior in mtron dataloader and AudioTextDataset in nemo/collections/multimodal/speech_llm/data/audio_text_dataset.py

return (n + m - 1) // m * m


class TextProcessing:
Copy link
Collaborator

@pzelasko pzelasko May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class needs more documentation on what is it doing, how to use its API, and what are the expected input and output formats. Also, it only has private methods right now, the main API method should be public (no underscore at the beginning).

I'd expect a docstring of kind: this class is used to convert X to Y. in order to do so, it performs A, B, C, and D. the expect format of X is .... the expected format of Y is ...

since it's used to convert text to prompts to token ids, I'd like to see full documentation of the prompt template/schema

the options to init also need documentation, if some are unused/unnecessary they may be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Keep the private function to follow the interface of _process_example from nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py

Copy link
Collaborator

@stevehuang52 stevehuang52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work, please address the CodeQL issues and see the minor comments.

return (n + m - 1) // m * m


class TextProcessing:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a copy or a modified version of the TextProcessing class in `audio_text_dataset? If it's a modified version, we should inherit from the parent class and only overwrite the necessary functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether making lhotse dataloader part inherits nemo mtron dataloader is a good idea for future since we have decided to move away from nemo one at some point. Added docstring of the cls. Hope it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants