You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-2Lines changed: 4 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -121,6 +121,8 @@ If you train TTS with LJSpeech dataset, you start to hear reasonable results aft
121
121
- Phoneme based training is enabled for easier learning and robust pronunciation. It also makes easier to adapt TTS to the most languages without worrying about language specific characters.
122
122
- Configurable attention windowing at inference-time for robust alignment. It enforces network to only consider a certain window of encoder steps per iteration.
123
123
- Detailed Tensorboard stats for activation, weight and gradient values per layer. It is useful to detect defects and compare networks.
124
+
- Constant history window. Instead of using only the last frame of predictions, define a constant history queue. It enables training with gradually decreasing prediction frame (r=5 --> r=1) by only changing the last layer. For instance, you can train the model with r=5 and then fine-tune it with r=1 without any performance loss. It also solves well-known PreNet problem [#50](https://github.com/mozilla/TTS/issues/50).
125
+
- Initialization of hidden decoder states with Embedding layers instead of zero initialization.
124
126
125
127
One common question is to ask why we don't use Tacotron2 architecture. According to our ablation experiments, nothing, except Location Sensitive Attention, improves the performance, given the increase in the model size.
126
128
@@ -130,13 +132,13 @@ Please feel free to offer new changes and pull things off. We are happy to discu
130
132
- Punctuations at the end of a sentence sometimes affect the pronunciation of the last word. Because punctuation sign is attended by the attention module, that forces the network to create a voice signal or at least modify the voice signal being generated for neighboring frames.
131
133
-~~Simpler stop-token prediction. Right now we use RNN to keep the history of the previous frames. However, we never tested, if something simpler would work as well.~~ Yet RNN based model gives more stable predictions.
132
134
- Train for better mel-specs. Mel-spectrograms are not good enough to be fed Neural Vocoder. Easy solution to this problem is to train the model with r=1. However, in this case, model struggles to align the attention.
133
-
- irregular words: "minute", "focus", "aren't" etc. Even though ~~it might be solved~~ (Nancy dataset delivers much better quality compared to LJSpeech) it is solved by a larger or a better dataset, some of the irregular words cause network to mispronounce.
135
+
- irregular words: "minute", "focus", "aren't" etc. Even though ~~it might be solved~~ (Use a better dataset like Nancy or train phonemes enabled.)
134
136
135
137
## Major TODOs
136
138
-[x] Implement the model.
137
139
-[x] Generate human-like speech on LJSpeech dataset.
138
140
-[x] Generate human-like speech on a different dataset (Nancy) (TWEB).
139
-
-[] Train TTS with r=1 successfully.
141
+
-[x] Train TTS with r=1 successfully.
140
142
-[ ] Enable process based distributed training. Similar [to] (https://github.com/fastai/imagenet-fast/).
141
143
-[ ] Adapting Neural Vocoder. The most active work is [here] (https://github.com/erogol/WaveRNN)
0 commit comments