Skip to content

Commit ca7e02f

Browse files
committed
'a';
1 parent f943e95 commit ca7e02f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+213
-2269
lines changed

.gitignore

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,7 @@ __*
88
tools
99

1010
models/
11-
data/
12-
datasets/
1311
logs/
1412
venv/
1513
raw/
16-
images/
14+
images/

Attention.py

Lines changed: 0 additions & 76 deletions
This file was deleted.

LICENSE

Lines changed: 0 additions & 19 deletions
This file was deleted.

README.md

Lines changed: 1 addition & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,3 @@
1-
# Voice Conversion with Non-Parallel Data
2-
## Subtitle: Speaking like Kate Winslet
3-
> Authors: Dabi Ahn([email protected]), [Kyubyong Park](https://github.com/Kyubyong)([email protected])
41

5-
## Samples
6-
https://soundcloud.com/andabi/sets/voice-style-transfer-to-kate-winslet-with-deep-neural-networks
72

8-
## Intro
9-
What if you could imitate a famous celebrity's voice or sing like a famous singer?
10-
This project started with a goal to convert someone's voice to a specific target voice.
11-
So called, it's voice style transfer.
12-
We worked on this project that aims to convert someone's voice to a famous English actress [Kate Winslet](https://en.wikipedia.org/wiki/Kate_Winslet)'s
13-
[voice](https://soundcloud.com/andabi/sets/voice-style-transfer-to-kate-winslet-with-deep-neural-networks).
14-
We implemented a deep neural networks to achieve that and more than 2 hours of audio book sentences read by Kate Winslet are used as a dataset.
15-
16-
<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/title.png" width="50%"></p>
17-
18-
## Model Architecture
19-
This is a many-to-one voice conversion system.
20-
The main significance of this work is that we could generate a target speaker's utterances without parallel data like <source's wav, target's wav>, <wav, text> or <wav, phone>, but only waveforms of the target speaker.
21-
(To make these parallel datasets needs a lot of effort.)
22-
All we need in this project is a number of waveforms of the target speaker's utterances and only a small set of <wav, phone> pairs from a number of anonymous speakers.
23-
24-
<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/architecture.png" width="85%"></p>
25-
26-
The model architecture consists of two modules:
27-
1. Net1(phoneme classification) classify someone's utterances to one of phoneme classes at every timestep.
28-
* Phonemes are speaker-independent while waveforms are speaker-dependent.
29-
2. Net2(speech synthesis) synthesize speeches of the target speaker from the phones.
30-
31-
We applied CBHG(1-D convolution bank + highway network + bidirectional GRU) modules that are mentioned in [Tacotron](https://arxiv.org/abs/1703.10135).
32-
CBHG is known to be good for capturing features from sequential data.
33-
34-
### Net1 is a classifier.
35-
* Process: wav -> spectrogram -> mfccs -> phoneme dist.
36-
* Net1 classifies spectrogram to phonemes that consists of 60 English phonemes at every timestep.
37-
* For each timestep, the input is log magnitude spectrogram and the target is phoneme dist.
38-
* Objective function is cross entropy loss.
39-
* [TIMIT dataset](https://catalog.ldc.upenn.edu/ldc93s1) used.
40-
* contains 630 speakers' utterances and corresponding phones that speaks similar sentences.
41-
* Over 70% test accuracy
42-
43-
### Net2 is a synthesizer.
44-
Net2 contains Net1 as a sub-network.
45-
* Process: net1(wav -> spectrogram -> mfccs -> phoneme dist.) -> spectrogram -> wav
46-
* Net2 synthesizes the target speaker's speeches.
47-
* The input/target is a set of target speaker's utterances.
48-
* Since Net1 is already trained in previous step, the remaining part only should be trained in this step.
49-
* Loss is reconstruction error between input and target. (L2 distance)
50-
* Datasets
51-
* Target1(anonymous female): [Arctic](http://www.festvox.org/cmu_arctic/) dataset (public)
52-
* Target2(Kate Winslet): over 2 hours of audio book sentences read by her (private)
53-
* Griffin-Lim reconstruction when reverting wav from spectrogram.
54-
55-
## Implementations
56-
### Requirements
57-
* python 2.7
58-
* tensorflow >= 1.1
59-
* numpy >= 1.11.1
60-
* librosa == 0.5.1
61-
62-
### Settings
63-
* sample rate: 16,000Hz
64-
* window length: 25ms
65-
* hop length: 5ms
66-
67-
### Procedure
68-
* Train phase: Net1 and Net2 should be trained sequentially.
69-
* Train1(training Net1)
70-
* Run `train1.py` to train and `eval1.py` to test.
71-
* Train2(training Net2)
72-
* Run `train2.py` to train and `eval2.py` to test.
73-
* Train2 should be trained after Train1 is done!
74-
* Convert phase: feed forward to Net2
75-
* Run `convert.py` to get result samples.
76-
* Check Tensorboard's audio tab to listen the samples.
77-
* Take a look at phoneme dist. visualization on Tensorboard's image tab.
78-
* x-axis represents phoneme classes and y-axis represents timesteps
79-
* the first class of x-axis means silence.
80-
81-
<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/phoneme_dist.png" width="30%"></p>
82-
83-
## Tips (Lessons We've learned from this project)
84-
* Window length and hop length have to be small enough to be able to fit in only a phoneme.
85-
* Obviously, sample rate, window length and hop length should be same in both Net1 and Net2.
86-
* Before ISTFT(spectrogram to waveforms), emphasizing on the predicted spectrogram by applying power of 1.0~2.0 is helpful for removing noisy sound.
87-
* It seems that to apply temperature to softmax in Net1 is not so meaningful.
88-
* IMHO, the accuracy of Net1(phoneme classification) does not need to be so perfect.
89-
* Net2 can reach to near optimal when Net1 accuracy is correct to some extent.
90-
91-
## References
92-
* ["Phonetic posteriorgrams for many-to-one voice conversion without parallel data training"](https://www.researchgate.net/publication/307434911_Phonetic_posteriorgrams_for_many-to-one_voice_conversion_without_parallel_data_training), 2016 IEEE International Conference on Multimedia and Expo (ICME)
93-
* ["TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS"](https://arxiv.org/abs/1703.10135), Submitted to Interspeech 2017
3+
Run train_rnn_lstm_3x_final_model.py to train.

a-input.png

-12.9 KB
Binary file not shown.

a-label.png

-13.8 KB
Binary file not shown.
-17.4 KB
Binary file not shown.
-17.3 KB
Binary file not shown.
-16.1 KB
Binary file not shown.
-25.1 KB
Binary file not shown.

0 commit comments

Comments
 (0)