Refactoring expressivity/predict into ExpressiveTranslator. #292

kauterry · 2023-12-20T20:19:42Z

Almost done with implementation, saving it as a draft since I'm yet to test it.

yilinyang7 · 2023-12-20T23:21:55Z

I'm not super sure of the intention of this PR, can I ask a few questions:

Why having both inference/ and predict/ subdirectories, as they're seem to serve same purpose it could be confusing
This new ExpressiveTranslator only support single-audio translation, its use-case would be too narrow, isn't it? The single audio translation is well-served by our two public demo (internal & HF) plus an gadio app.py script. I'm not convinced why we'll need this.

kauterry · 2023-12-21T00:44:47Z

I'm not super sure of the intention of this PR, can I ask a few questions:

Why having both inference/ and predict/ subdirectories, as they're seem to serve same purpose it could be confusing

This new ExpressiveTranslator only support single-audio translation, its use-case would be too narrow, isn't it? The single audio translation is well-served by our two public demo (internal & HF) plus an gadio app.py script. I'm not convinced why we'll need this.

There is absolutely no harm in having this extra way of doing expressivity inference. Developers aren’t going to use the demo, they need the code. If you see for M4T we have m4t_predict which is a thin layer around the translator class, we want to stick to the same format for expressivity_predict as well. Btw the more ways we have the easier we make our models for adoption, that’s the whole point of OSS. In this case, writing an expressive translator strips out all the unnecessary boilerplate from the user, and they can call expressive inference easily, making it in the same format as M4T inference.

I don’t think it’s confusing, the CLI directory has stuff which the user can leverage directly. Inference acts as a bridge between the CLI and the actual models. If we have a translator, we definitely need to have an expressive_translator. So I strongly disagree with you on this. In fact, this discrepancy between m4t_predict’s design and expressivity_predict’s design was pointed out by @elbayadm. I feel we should’ve paid more attention to the design of these interfaces which expose our models to the users.

kauterry · 2023-12-21T00:46:09Z

predict falls under the CLI directory. Inference is a directory of its own, feel free to suggest a better name to make it clear.

yilinyang7 · 2023-12-21T15:29:29Z

Thanks for the reply.
I agree it doesn't harm to implement this extra single-audio inference script.
Yet my point is, when I looked at ExpressiveTranslator class, it only supports single-audio inference. With such a general class name, isn't that too limited, and could confuse future users? I believe it should support batching as Translator and PretsselGenerator does.
For the directory, I don't understand why you want to have both cli/expressivity/predict and cli/expressivity/inference. We could instead put the script files under one directory?

yilinyang7 · 2023-12-21T15:33:49Z

For expressivity inference, the batched inference and the automatic evaluation metrics that runs after it should be the default point-of-entry.
I don't disagree with an additional single audio inference script, but just align us on what's the main point-of-entry.
If this ExpressiveTranslator class doesn't support batched inference, it'll be confusing to the users.

kauterry · 2023-12-21T16:49:22Z

This is going to support batched inference, and I will refactor expressivity_evaluate to use this script. Exactly like how Translator is used in m4t_evaluate and m4t_predict. How does that sound?

yilinyang7 · 2023-12-21T17:05:15Z

This is going to support batched inference, and I will refactor expressivity_evaluate to use this script. Exactly like how Translator is used in m4t_evaluate and m4t_predict. How does that sound?

yep, that sounds good, thanks!

elbayadm · 2023-12-21T21:32:01Z

Thank you @kauterry for drafting this PR.
@yilinyang7, the main thing that I thought was missing in SC is this:
speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line.
Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

elbayadm · 2023-12-21T21:35:50Z

README.md

@@ -161,7 +161,7 @@ Please check out above [section](#seamlessexpressive-models) on how to acquire `
 ### W2v-BERT 2.0 speech encoder
 | Model Name | #params | checkpoint |
 | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| W2v-BERT 2.0 | 600M | [🤗 Model card](https://huggingface.co/facebook/conformer-shaw) - [checkpoint](https://huggingface.co/facebook/conformer-shaw/resolve/main/conformer_shaw.pt)
+| W2v-BERT 2.0 | 600M | [🤗 Model card](https://huggingface.co/facebook/w2v-bert-2.0) - [checkpoint](https://huggingface.co/facebook/w2v-bert-2.0/resolve/main/conformer_shaw.pt)


can we fix this in another PR

yilinyang7 · 2023-12-21T21:54:56Z

predict falls under the CLI directory. Inference is a directory of its own, feel free to suggest a better name to make it clear.

sorry I missed this, and I just realize I mis-read the directory layers.

yilinyang7 · 2023-12-21T22:03:09Z

Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line. Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

Thanks for the clarification, that makes sense!
I think the current discrepancy is that expressivity_evaluate does batched inference, while current ExpressiveTranslator class only support single-audio inference.
If we want a clearer API call, we might need to refactor every script to use this new ExpressiveTranslator class, which should support batched inference for efficiency.
If I misunderstood anything, let me know!

elbayadm · 2023-12-21T22:07:24Z

Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line. Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

Thanks for the clarification, that makes sense! I think the current discrepancy is that expressivity_evaluate does batched inference, while current ExpressiveTranslator class only support single-audio inference. If we want a clearer API call, we might need to refactor every script to use this new ExpressiveTranslator class, which should support batched inference for efficiency. If I misunderstood anything, let me know!

That's reasonable, if it's already implemented then the new ExpressiveTranslator should use that as well.

yilinyang7

After the bug fix, the batched ExpressiveTranslator inference is done, and could replicate previous results over all metrics.

Refactoring expressivity/predict into ExpressiveTranslator.

ceada0c

kauterry requested review from cndn, elbayadm, cbalioglu and yilinyang7 December 20, 2023 20:19

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2023

elbayadm reviewed Dec 21, 2023

View reviewed changes

yilinyang7 added 2 commits January 4, 2024 21:19

batched expressive_translator

778e11e

bug fix & formatter

791004d

yilinyang7 approved these changes Jan 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring expressivity/predict into ExpressiveTranslator. #292

Refactoring expressivity/predict into ExpressiveTranslator. #292

kauterry commented Dec 20, 2023

yilinyang7 commented Dec 20, 2023 •

edited

kauterry commented Dec 21, 2023

kauterry commented Dec 21, 2023

yilinyang7 commented Dec 21, 2023 •

edited

yilinyang7 commented Dec 21, 2023

kauterry commented Dec 21, 2023

yilinyang7 commented Dec 21, 2023

elbayadm commented Dec 21, 2023

elbayadm Dec 21, 2023

kauterry Jan 11, 2024

yilinyang7 commented Dec 21, 2023

yilinyang7 commented Dec 21, 2023

elbayadm commented Dec 21, 2023

yilinyang7 left a comment

Refactoring expressivity/predict into ExpressiveTranslator. #292

Are you sure you want to change the base?

Refactoring expressivity/predict into ExpressiveTranslator. #292

Conversation

kauterry commented Dec 20, 2023

yilinyang7 commented Dec 20, 2023 • edited

kauterry commented Dec 21, 2023

kauterry commented Dec 21, 2023

yilinyang7 commented Dec 21, 2023 • edited

yilinyang7 commented Dec 21, 2023

kauterry commented Dec 21, 2023

yilinyang7 commented Dec 21, 2023

elbayadm commented Dec 21, 2023

elbayadm Dec 21, 2023

Choose a reason for hiding this comment

kauterry Jan 11, 2024

Choose a reason for hiding this comment

yilinyang7 commented Dec 21, 2023

yilinyang7 commented Dec 21, 2023

elbayadm commented Dec 21, 2023

yilinyang7 left a comment

Choose a reason for hiding this comment

yilinyang7 commented Dec 20, 2023 •

edited

yilinyang7 commented Dec 21, 2023 •

edited