-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169
base: main
Are you sure you want to change the base?
Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169
Conversation
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: zhehuaichen <[email protected]>
for more information, see https://pre-commit.ci
… feature/lhotse-integration
for more information, see https://pre-commit.ci
…sko/nemo into feature/lhotse-integration
for more information, see https://pre-commit.ci
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
examples/multimodal/speech_llm/conf/modular_audio_gpt_config_cross_llama_lhotse.yaml
Outdated
Show resolved
Hide resolved
examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml
Outdated
Show resolved
Hide resolved
examples/multimodal/speech_llm/conf/modular_audio_t5_config.yaml
Outdated
Show resolved
Hide resolved
nemo/collections/multimodal/speech_llm/models/modular_models_t5.py
Outdated
Show resolved
Hide resolved
vectors = collate_vectors_lhotse(items, padding_value=padding_value) | ||
if max_length > vectors.size(1): | ||
vectors = torch.cat( | ||
[vectors, padding_value * torch.ones(vectors.size(0), max_length - vectors.size(1), dtype=vectors.dtype)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to enforce a static shape with padding for every example here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with the behavior in mtron dataloader and AudioTextDataset in nemo/collections/multimodal/speech_llm/data/audio_text_dataset.py
return (n + m - 1) // m * m | ||
|
||
|
||
class TextProcessing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class needs more documentation on what is it doing, how to use its API, and what are the expected input and output formats. Also, it only has private methods right now, the main API method should be public (no underscore at the beginning).
I'd expect a docstring of kind: this class is used to convert X to Y. in order to do so, it performs A, B, C, and D. the expect format of X is .... the expected format of Y is ...
since it's used to convert text to prompts to token ids, I'd like to see full documentation of the prompt template/schema
the options to init also need documentation, if some are unused/unnecessary they may be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Keep the private function to follow the interface of _process_example from nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py
examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml
Outdated
Show resolved
Hide resolved
examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml
Outdated
Show resolved
Hide resolved
examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work, please address the CodeQL issues and see the minor comments.
return (n + m - 1) // m * m | ||
|
||
|
||
class TextProcessing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a copy or a modified version of the TextProcessing
class in `audio_text_dataset? If it's a modified version, we should inherit from the parent class and only overwrite the necessary functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether making lhotse dataloader part inherits nemo mtron dataloader is a good idea for future since we have decided to move away from nemo one at some point. Added docstring of the cls. Hope it helps.
nemo/collections/multimodal/speech_llm/parts/utils/data_utils.py
Outdated
Show resolved
Hide resolved
Signed-off-by: zhehuaichen <[email protected]>
What does this PR do ?
In multimodal/speech_llm, add lhotse dataloader support and two models, SALM-T5 and Bestow-GPT. Include example configs.
Main features under speech_llm
Minor edit in nlp collection:
Collection: [common,nlp,multimodal]
PR Type: