|
1 |
| -# Voice Conversion with Non-Parallel Data |
2 |
| -## Subtitle: Speaking like Kate Winslet |
3 |
| -> Authors: Dabi Ahn([email protected]), [Kyubyong Park](https://github.com/Kyubyong)([email protected]) |
4 | 1 |
|
5 |
| -## Samples |
6 |
| -https://soundcloud.com/andabi/sets/voice-style-transfer-to-kate-winslet-with-deep-neural-networks |
7 | 2 |
|
8 |
| -## Intro |
9 |
| -What if you could imitate a famous celebrity's voice or sing like a famous singer? |
10 |
| -This project started with a goal to convert someone's voice to a specific target voice. |
11 |
| -So called, it's voice style transfer. |
12 |
| -We worked on this project that aims to convert someone's voice to a famous English actress [Kate Winslet](https://en.wikipedia.org/wiki/Kate_Winslet)'s |
13 |
| -[voice](https://soundcloud.com/andabi/sets/voice-style-transfer-to-kate-winslet-with-deep-neural-networks). |
14 |
| -We implemented a deep neural networks to achieve that and more than 2 hours of audio book sentences read by Kate Winslet are used as a dataset. |
15 |
| - |
16 |
| -<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/title.png" width="50%"></p> |
17 |
| - |
18 |
| -## Model Architecture |
19 |
| -This is a many-to-one voice conversion system. |
20 |
| -The main significance of this work is that we could generate a target speaker's utterances without parallel data like <source's wav, target's wav>, <wav, text> or <wav, phone>, but only waveforms of the target speaker. |
21 |
| -(To make these parallel datasets needs a lot of effort.) |
22 |
| -All we need in this project is a number of waveforms of the target speaker's utterances and only a small set of <wav, phone> pairs from a number of anonymous speakers. |
23 |
| - |
24 |
| -<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/architecture.png" width="85%"></p> |
25 |
| - |
26 |
| -The model architecture consists of two modules: |
27 |
| -1. Net1(phoneme classification) classify someone's utterances to one of phoneme classes at every timestep. |
28 |
| - * Phonemes are speaker-independent while waveforms are speaker-dependent. |
29 |
| -2. Net2(speech synthesis) synthesize speeches of the target speaker from the phones. |
30 |
| - |
31 |
| -We applied CBHG(1-D convolution bank + highway network + bidirectional GRU) modules that are mentioned in [Tacotron](https://arxiv.org/abs/1703.10135). |
32 |
| -CBHG is known to be good for capturing features from sequential data. |
33 |
| - |
34 |
| -### Net1 is a classifier. |
35 |
| -* Process: wav -> spectrogram -> mfccs -> phoneme dist. |
36 |
| -* Net1 classifies spectrogram to phonemes that consists of 60 English phonemes at every timestep. |
37 |
| - * For each timestep, the input is log magnitude spectrogram and the target is phoneme dist. |
38 |
| -* Objective function is cross entropy loss. |
39 |
| -* [TIMIT dataset](https://catalog.ldc.upenn.edu/ldc93s1) used. |
40 |
| - * contains 630 speakers' utterances and corresponding phones that speaks similar sentences. |
41 |
| -* Over 70% test accuracy |
42 |
| - |
43 |
| -### Net2 is a synthesizer. |
44 |
| -Net2 contains Net1 as a sub-network. |
45 |
| -* Process: net1(wav -> spectrogram -> mfccs -> phoneme dist.) -> spectrogram -> wav |
46 |
| -* Net2 synthesizes the target speaker's speeches. |
47 |
| - * The input/target is a set of target speaker's utterances. |
48 |
| -* Since Net1 is already trained in previous step, the remaining part only should be trained in this step. |
49 |
| -* Loss is reconstruction error between input and target. (L2 distance) |
50 |
| -* Datasets |
51 |
| - * Target1(anonymous female): [Arctic](http://www.festvox.org/cmu_arctic/) dataset (public) |
52 |
| - * Target2(Kate Winslet): over 2 hours of audio book sentences read by her (private) |
53 |
| -* Griffin-Lim reconstruction when reverting wav from spectrogram. |
54 |
| - |
55 |
| -## Implementations |
56 |
| -### Requirements |
57 |
| -* python 2.7 |
58 |
| -* tensorflow >= 1.1 |
59 |
| -* numpy >= 1.11.1 |
60 |
| -* librosa == 0.5.1 |
61 |
| - |
62 |
| -### Settings |
63 |
| -* sample rate: 16,000Hz |
64 |
| -* window length: 25ms |
65 |
| -* hop length: 5ms |
66 |
| - |
67 |
| -### Procedure |
68 |
| -* Train phase: Net1 and Net2 should be trained sequentially. |
69 |
| - * Train1(training Net1) |
70 |
| - * Run `train1.py` to train and `eval1.py` to test. |
71 |
| - * Train2(training Net2) |
72 |
| - * Run `train2.py` to train and `eval2.py` to test. |
73 |
| - * Train2 should be trained after Train1 is done! |
74 |
| -* Convert phase: feed forward to Net2 |
75 |
| - * Run `convert.py` to get result samples. |
76 |
| - * Check Tensorboard's audio tab to listen the samples. |
77 |
| - * Take a look at phoneme dist. visualization on Tensorboard's image tab. |
78 |
| - * x-axis represents phoneme classes and y-axis represents timesteps |
79 |
| - * the first class of x-axis means silence. |
80 |
| - |
81 |
| -<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/phoneme_dist.png" width="30%"></p> |
82 |
| - |
83 |
| -## Tips (Lessons We've learned from this project) |
84 |
| -* Window length and hop length have to be small enough to be able to fit in only a phoneme. |
85 |
| -* Obviously, sample rate, window length and hop length should be same in both Net1 and Net2. |
86 |
| -* Before ISTFT(spectrogram to waveforms), emphasizing on the predicted spectrogram by applying power of 1.0~2.0 is helpful for removing noisy sound. |
87 |
| -* It seems that to apply temperature to softmax in Net1 is not so meaningful. |
88 |
| -* IMHO, the accuracy of Net1(phoneme classification) does not need to be so perfect. |
89 |
| - * Net2 can reach to near optimal when Net1 accuracy is correct to some extent. |
90 |
| - |
91 |
| -## References |
92 |
| -* ["Phonetic posteriorgrams for many-to-one voice conversion without parallel data training"](https://www.researchgate.net/publication/307434911_Phonetic_posteriorgrams_for_many-to-one_voice_conversion_without_parallel_data_training), 2016 IEEE International Conference on Multimedia and Expo (ICME) |
93 |
| -* ["TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS"](https://arxiv.org/abs/1703.10135), Submitted to Interspeech 2017 |
| 3 | +Run train_rnn_lstm_3x_final_model.py to train. |
0 commit comments