-
Notifications
You must be signed in to change notification settings - Fork 0
comparison #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
WojciechMat
wants to merge
15
commits into
MIDI-120/unsupervised-training
Choose a base branch
from
MIDI-120/comparison
base: MIDI-120/unsupervised-training
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
comparison #6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
roszcz
approved these changes
Nov 26, 2023
Co-authored-by: Tomek Roszczynialski <[email protected]>
6abb305
to
5f4b377
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Comparison of
midi-translation
andmidi-hf-transformers
projects...Here is the pathetique.md:
Analysis of differences in data and training in midi-hf-transformers and midi-translation projects.
Abstract
I will outline the data flow within both projects, point out similarities in architecture, data, and training
hyperparameters, voice my expectations, and show how they have not been met.
Data
In both projects, the data processed by the encoder objects is identical at the input to these encoders.
They use precisely the same functions, particularly when using "dstart" as the time-quantization method.
To specify, the data beforehand is a table with 9 columns and 128 rows, with each row describing a note.
Quantization bins remain the same. Everything remains the same.
Encoders
What changed in the encoders?
HuggingFace transformers use the same input and output layer sizes, which means they use the same
vocabulary for source and output tokens. Staying true to their HF implementation, in midi_hf_transformers
project encoders have ability to encode both source and target sequences with one vocabulary.
midi-translation encoders
midi-hf-transformers encoders
Huggingface transformer has to learn to use only specific type of words from its vocabulary, yes, but
it seems to grasp it very quickly and reliably. I have never seen a transformer use a token that was not present
among training target tokens. I do not think that the vocabulary size is an issue.
The tokens produced by there two pairs of encoders look exactly the same. For example:
They do differ in token ids (obviously):
Architecture
Creators of T5 project, in
"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
claim T5 architecture is exactly the same as described in "Attention is All You Need".
That means it is also the same as our midi-translation model.
In both midi-translation and midi-hf-transformers projects we are able to change some of the model hyperparameters
such as:
Training
Learning rate schedule
midi-translation uses learning rate schedule given with equation:
lrate = d_model^(0.5) * min(step_num^(-0.5), step_num * warmup_steps^(-1.5))
It is the exact lr schedule as used in "Attention is All You Need" paper.
In midi-hf-transformers project learning rate remains the same throughout the training.
Loss function
midi-translation transformer uses Label Smoothing loss described in detail in paper "Rethinking the inception
architecture for computer vision"
HuggingFace transformers use regular cross-entropy loss.
Experiment
Metric
Because loss functions are different, average distance between predicted velocity and ground truth
is used to compare performance of two models.
Because velocity tokens are appearing in the right order and next to each other in vocabularies,
the average distance between velocities is equal to average distance between token ids, so that is what is
really being calculated.
At the beginning of training a T5 model which has much larger output vocabulary this distance is greater,
but when it learns to only use velocity tokens it should behave like in the case of midi-translation model.
Expectations
Ideally the T5 model will learn to predict velocities for a sequence of notes as good as (or better than)
out midi-translation model.
That is unfortunately not the case.
Parameters and learning rate
Two models with the same hyperparamerets have been trained.
On the same data:
When using warmup_steps=4000, the midi-translation learning rate starts from 2e-7 and rises up to 1e-3 before starting to drop.

Constant learning rate of 3e-6 was used in midi-hf-transformers model.
I also let the HF model run for 25 epochs instead of 5, taking into consideration that maybe all
learning schedule does is speed up the training...
Results
midi-translation
check out wandb
The predictions look like the model knows what it is doing and as if he was really trying to play some

emotional music:
midi-hf-transformers
wandb
Its 2.5 times larger distance than midi-translation and still is one of the best reached by the HF models.
The results look reaaaally flat:

After looking through prediction on train split it seems like the model is learning to predict the mean
of the velocities and spam them for every note instead of actual, more interesting results.
Conclusions
Both models are fed the same data, use the same dashboard logic and theoretically have the same architecture.
Because of the differences in lr-schedule and loss function huggingface model is unexpectedly pathetic in learning
dynamic expression in music.
Perhaps cross-entropy, while working fine in language domain, fails miserably as effective
metric for musical models and that is where the issue comes from.
I guess that is what I will explore in my next experiment ...