Minimal tf 2.0 implementation of Listen, attend and spell (https://arxiv.org/abs/1508.01211). To get a better understanding of the naming of the models variables please see the paper above.
- Model architecture looks right to me. If you find an error in the code please dont hesitate to open an issue 😊
- Implement data handing for easier training of model.
- Train on LibriSpeech 100h
- Implement specAugment features (prev SOTA LibriSpeech) (https://arxiv.org/abs/1904.08779)
The file model.py contains the architecture of the model. Example usage below.
"""
def LAS(dim, f_1, no_tokens):
dim: Number of hidden neurons for most LSTM's.
f_1: pBLSTM takes (Batch, timesteps, f_1) as input, f_1 is number of features of the mel spectrogram
per timestep. Timestep is the width of the spectrogram.
No_tokens: Number of unique tokens for input and output vector.
"""
model = LAS(256, 256, 16)
model.compile(loss="mse", optimizer="adam")
# x_1 should have shape (Batch-size, timesteps, f_1)
x_1 = np.random.random((1, 550, 256))
# x_2 should have shape (Batch-size, no_prev_tokens, No_tokens). The token vector should be one-hot encoded.
x_2 = np.zeros((1,12,16))
for n in range(12):
x_2[0, n, np.random.randint(1, 16)] = 1
# By passing x_1 and x_2 the model will predict the 12th token
# given by the spectogram and the prev predicted tokens
model.predict([x_1, x_2])