comparison #6

WojciechMat · 2023-11-23T21:07:13Z

Comparison of midi-translation and midi-hf-transformers projects...
Here is the pathetique.md:

Analysis of differences in data and training in midi-hf-transformers and midi-translation projects.

Abstract

I will outline the data flow within both projects, point out similarities in architecture, data, and training
hyperparameters, voice my expectations, and show how they have not been met.

Data

In both projects, the data processed by the encoder objects is identical at the input to these encoders.
They use precisely the same functions, particularly when using "dstart" as the time-quantization method.
To specify, the data beforehand is a table with 9 columns and 128 rows, with each row describing a note.

pitch	dstart_bin	duration_bin	velocity_bin	start	end	quant_start	quant_duration	velocity	source

Quantization bins remain the same. Everything remains the same.

Encoders

What changed in the encoders?

HuggingFace transformers use the same input and output layer sizes, which means they use the same
vocabulary for source and output tokens. Staying true to their HF implementation, in midi_hf_transformers
project encoders have ability to encode both source and target sequences with one vocabulary.

midi-translation encoders

classDiagram
class QuantizedMidiEncoder{
    specials: Dict[str]  # CLS token only
    vocab: list[str]  # containing all possible note tokens
    encode() from record to token ids
    decode() from token ids to a dataframe
}
class VelocityEncoder{
    specials: Dict[str]  # CLS token only
    vocab: list[str]  # containing 128 velocity tokens
    encode() from record to token ids
    decode() from token ids to a list[int]

}

midi-hf-transformers encoders

classDiagram
class QuantizedMidiEncoder{
    specials: Dict[str]  # CLS and PAD tokens
    vocab: list[str]  # containing all possible note tokens
    encode_src() from record to token ids
    decode_src() from token ids to a dataframe
}

class VelocityEncoder{
    specials: Dist[str] # CLS and PAD tokens
    _src_encoder: QuantizedMidiEncoder # encoder for source data
    vocab: list[str] # containing src_encoder vocab and 128 velocity tokens
    encode_src()
    decode_src()
    encode_tgt()
    decode_tgt()
}

Huggingface transformer has to learn to use only specific type of words from its vocabulary, yes, but
it seems to grasp it very quickly and reliably. I have never seen a transformer use a token that was not present
among training target tokens. I do not think that the vocabulary size is an issue.

The tokens produced by there two pairs of encoders look exactly the same. For example:

tokens from midi-translation encoders:
src: ['<CLS>', '42-0-1-0', '67-0-1-1', '55-0-1-0', '59-0-1-0', '49-0-1-0', '53-1-1-1', '63-0-2-1', ... ]
tgt: ['<CLS>', '6', '73', '53', '42', '42', '68', '69', '78', '62', '79', '60', '55', '84', '75', ... ]
tokens from midi-hf-transformers encoder:
src: ['<CLS>', '42-0-1-0', '67-0-1-1', '55-0-1-0', '59-0-1-0', '49-0-1-0', '53-1-1-1', '63-0-2-1', ... ]
tgt: ['<CLS>', '6', '73', '53', '42', '42', '68', '69', '78', '62', '79', '60', '55', '84', '75', ... ]

They do differ in token ids (obviously):

midi-translation:
src: [ 0,  571, 1247,  922, 1030,  760,  878, 1142, 1386,  548, 1287,  719, ... ]
tgt: [ 0,   7,  74,  54,  43,  43,  69,  70,  79,  63,  80,  61,  56,  85, ... ]

midi-hf-transformers:
src: [ 0,  572, 1248,  923, 1031,  761,  879, 1143, 1387,  549, 1288,  720, ... ]
tgt: [ 0, 2384, 2451, 2431, 2420, 2420, 2446, 2447, 2456, 2440, 2457, 2438, ... ]

Architecture

Creators of T5 project, in
"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
claim T5 architecture is exactly the same as described in "Attention is All You Need".
That means it is also the same as our midi-translation model.

In both midi-translation and midi-hf-transformers projects we are able to change some of the model hyperparameters
such as:

  d_model
  d_kv
  d_ff
  num_layers
  num_heads

Training

Learning rate schedule

midi-translation uses learning rate schedule given with equation:

lrate = d_model^(0.5) * min(step_num^(-0.5), step_num * warmup_steps^(-1.5))

It is the exact lr schedule as used in "Attention is All You Need" paper.
In midi-hf-transformers project learning rate remains the same throughout the training.

Loss function

midi-translation transformer uses Label Smoothing loss described in detail in paper "Rethinking the inception
architecture for computer vision"

HuggingFace transformers use regular cross-entropy loss.

Experiment

Metric

Because loss functions are different, average distance between predicted velocity and ground truth
is used to compare performance of two models.
Because velocity tokens are appearing in the right order and next to each other in vocabularies,
the average distance between velocities is equal to average distance between token ids, so that is what is
really being calculated.
At the beginning of training a T5 model which has much larger output vocabulary this distance is greater,
but when it learns to only use velocity tokens it should behave like in the case of midi-translation model.

Expectations

Ideally the T5 model will learn to predict velocities for a sequence of notes as good as (or better than)
out midi-translation model.
That is unfortunately not the case.

Parameters and learning rate

Two models with the same hyperparamerets have been trained.

model:
  d_model: 256
  d_ff: 1024
  num_layers: 4
  num_heads: 4

On the same data:

dataset_name: 'roszcz/maestro-v1-sustain'
dataset:
  sequence_len: 128
  sequence_step: 42

  quantization:
    dstart: 5
    duration: 5
    velocity: 3

When using warmup_steps=4000, the midi-translation learning rate starts from 2e-7 and rises up to 1e-3 before starting to drop.

Constant learning rate of 3e-6 was used in midi-hf-transformers model.

I also let the HF model run for 25 epochs instead of 5, taking into consideration that maybe all
learning schedule does is speed up the training...

Results

midi-translation

check out wandb

val_loss: 2.486
val_dist: 4.682

The predictions look like the model knows what it is doing and as if he was really trying to play some
emotional music:

midi-hf-transformers

wandb

val_loss: 4.445
val_dist: 11.724

Its 2.5 times larger distance than midi-translation and still is one of the best reached by the HF models.

The results look reaaaally flat:

After looking through prediction on train split it seems like the model is learning to predict the mean
of the velocities and spam them for every note instead of actual, more interesting results.

Conclusions

Both models are fed the same data, use the same dashboard logic and theoretically have the same architecture.

Because of the differences in lr-schedule and loss function huggingface model is unexpectedly pathetic in learning
dynamic expression in music.
Perhaps cross-entropy, while working fine in language domain, fails miserably as effective
metric for musical models and that is where the issue comes from.

I guess that is what I will explore in my next experiment ...

comparison/data_new/dataset.py

Co-authored-by: Tomek Roszczynialski <[email protected]>

WojciechMat and others added 14 commits November 23, 2023 22:06

clean

e1e5a3d

Update pathetique.md with images

dcee853

real time quantizer, multiple tokens per note tokenizer

5c427bc

building dataset

b2be440

fix bug

4b2c229

another fix, plot sequence lengths

238f932

completely new data classes to match T5 data logic

8662a1a

tokenizers update, configs and demo training

fb4b16e

fix bug: should have subtracted 1 from np.quantize

f1598ae

rename tokens

c0ace45

tokenization dashboard

522712a

continuous -> absolute (CT -> AT)

7a01a25

fix quantization

f97927e

rebase

135185f

WojciechMat changed the base branch from master to MIDI-120/unsupervised-training November 23, 2023 21:13

roszcz approved these changes Nov 26, 2023

View reviewed changes

comparison/data_new/dataset.py Outdated Show resolved Hide resolved

Update comparison/data_new/dataset.py

269f8d1

Co-authored-by: Tomek Roszczynialski <[email protected]>

WojciechMat force-pushed the MIDI-120/unsupervised-training branch from 6abb305 to 5f4b377 Compare December 26, 2023 16:44

roszcz added the documentation Improvements or additions to documentation label Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

comparison #6

comparison #6

Uh oh!

WojciechMat commented Nov 23, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

comparison #6

Are you sure you want to change the base?

comparison #6

Uh oh!

Conversation

WojciechMat commented Nov 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis of differences in data and training in midi-hf-transformers and midi-translation projects.

Abstract

Data

Encoders

midi-translation encoders

midi-hf-transformers encoders

Architecture

Training

Learning rate schedule

Loss function

Experiment

Metric

Expectations

Parameters and learning rate

Results

midi-translation

midi-hf-transformers

Conclusions

Uh oh!

Uh oh!

Uh oh!

WojciechMat commented Nov 23, 2023 •

edited

Loading