Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Datamake #12

Open
shezanmirzan opened this issue Jul 11, 2018 · 3 comments
Open

Training Datamake #12

shezanmirzan opened this issue Jul 11, 2018 · 3 comments

Comments

@shezanmirzan
Copy link

As per your comment in one of the closed issue, you mentioned that you concatenate different sound effects to make one long sound wave containing noises and then pick a random speech utterance and add that speech utterance to noise files at various SNRs until the end of the long sound wave of noises.

But the datamake you have uploaded in the speech enhancement toolkit does something different. It picks up random intervals from the long concatenated sound wave containing noises and mix it with different sound files.

So in the second case, one utterance of speech file doesn't get added to whole of the long concatenated noise. Instead a random interval of the long concatenated noise gets mixed with each sound file.

Can you explain why did you take first approach to create the dataset for training the VAD model? And second, how can I do the same thing you are doing? Should I use FaNT ? Or your make_train_noisy.m has options to do so?

@jtkim-kaist
Copy link
Owner

jtkim-kaist commented Jul 12, 2018

  1. VAD and speech enhancement (SE) dataset must be different. The reason is that SE assumes that incoming signal is always noisy speech not noise only. We also have exploited the VAD dataset (first approach as you mentioned) to train the SE model, however, as expected, failed to train (The problem to solve become more hard because of the noise only segments). In contrast, for the VAD dataset, the ratio between noise only and noisy speech segments should be almost equal in order to prevent the class imbalance problem. These are the reason why we follow different methods to make the VAD and SE dataset. To make the VAD dataset, just follow the way you mentioned "As per your comment in one of the closed issue, you mentioned that you concatenate different sound effects to make one long sound wave containing noises and then pick a random speech utterance and add that speech utterance to noise files at various SNRs until the end of the long sound wave of noises." Additionally YOU MUST VERIFY THE RATIO BETWEEN NOISE (labeled to 0) AND SPEECH SEGMENTS (labeled to 1), ideally, 1 : 1 is the best.

  2. Both the fant tool and v_add_noise.m in voicebox (implemented by MATLAB) follow the ITU standard. Therefore, according to my experiments, they didn't show significant difference so that I prefer to use voicebox because of easy implementation. Use anything you want. The make_train_noisy.m is ONLY for the speech enhancement toolkit.

@shezanmirzan
Copy link
Author

I have one more doubt, I would be grateful if you help me out with this. Suppose even I create a long file containing noise of 40 mins and suppose the speech utterance is of 1 minute. On using v_addNoise, it just gives me an output noisy speech of 1 minute in which a noise interval is randomly picked up from the long noise file and added to the speech.

According to you, we however need to add the same 1 minute file 40 times to the long noise file at different SNRs. How do we do that using V_AddNoise? Is it even possible using v_addNoise or should I try with the FaNT tool?

However, a big thanks for clearing my above doubt.

@jtkim-kaist
Copy link
Owner

the inputs for vaddnoise.m should be
step 1
noise(1:length(speech1)), speech1
step 2
noise(length(speech1)+1 :length(speech1)+1 + length(speech2) ), speech2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants