Skip to content

Use Bayesian optimization to convert image's main color into music, Python 2020

Notifications You must be signed in to change notification settings

Defasium/bayesVec2Midi

Repository files navigation

bayesVec2Midi

title

Use Bayesian optimisation to convert image's main color into music, Python 2020.


Table of Contents


Description

We can encode any dataset into smaller dimension space with the help of Metric Learning approaches, such as Siamese Learning, Self-Learning, Triplet-Learning. To simplify research, an image's dataset with color that attracts most attention as labels was constructed and trained with triplets. You can see the resulting embedding space of images, projected to 3d space with UMAP here:

title

Network was trained in Image2Vec_learning.ipynb notebook.

You can find Image2Vec convertion in Image2Vec_prediction.ipynb notebook.

Bayesian Search of transition matrix's weights to map embedding's space to Google's musicVAE latent space. The task of matching embeddings to musicVAE latent space have several difficulties. To start with, we want to distinguishably compare melodies not in the latent space, but in the final midi form.

That implies that we can't calculate gradients with respect to weights, so we can't use family of Gradient Descent Optimisation's algorithms. That's why we can either randomly pick weights until good metric score or use Bayesian Optimisation, which take into account previous attempts. That helps to converge faster that Random Search in general.

That means we need to somehow compare different sequences with not equal lengths. Existing approaches either calculate area between melodies interpolated to the same length or use some versions of Dynamic Time Warping algorithm. Our approach uses it's own midi sequences similarity function. This function takes into account difference in pitches, difference in durations and difference in first notes and use Manhattan Distance to get final score. Manhattan distance is often used in practice to compare signals and melodies sequences are some kind of signals.

Frankly speaking, using bayesian search to find weights of transition matrix is a bad idea. There are latent_space_size times embedding_size number of parameters to optimize. Or 512x9~=5k parameters. A rule of thumb is to use 12-16 times iterations as number of parameters. Considering that bayesian optimisation is O(n^3) due to necessity of finding inverse matrix, where n is the number of parameters, this optimisation can take years. To leverage this problem this module use parameters reduction through Gaussian random projection. So instead of finding matrix weights we search 100 coefficients of linear combination of randomly generated matrices. Thus, number of parameters was dropped fiftyfold.

You can find Vec2Midi learning in music_learning.py script.

It's worth to mention that musciVAE latent space has some garbage dimensions which affects the generated melodies only by small amount.

title

In the research was have shown that only 10-20 dimensions of musicVAE latent space are vital to melodies generations.

title

After training we use nearest cluster's center in embedding space to generate melodies. You can see the example of pipeline in the Pipeline.ipynb. The results visualized with Synthesia are shown bellow:

title


Sources

For triplet loss training this work uses modified version of implementation, created by @CrimyTheBold.

For midi generation this work uses Google's magenta project, specifically musicVAE, with pretrained cat-mel_2bar_big architecture's weights.

To reduce parameters this work uses concept idea from Wu C. W. ProdSumNet: reducing model parameters in deep neural networks via product-of-sums matrix decompositions //arXiv preprint arXiv:1809.02209. – 2018.