Problematic spectrograms #6

DeivisonJordao · 2024-11-27T17:55:56Z

I am currently experiencing an issue during the preprocessing step that seems to be affecting the resulting metrics. Specifically, when I attempt to generate spectrograms from scratch (instead of using the spectrograms you provided), something seems to go awry, which negatively impacts the training process.
To elaborate, I conducted some tests where I downloaded audio from the Harmonix dataset (hung et. al version) , which provides specific URLs to ensure that the song versions are consistent. I then used your annotations, but the spectrograms I generated differed from those you provided. I believe this discrepancy should not be occurring. Moreover, this issue arises with every dataset I try to train the model on, suggesting that it might not be a data problem. to help, this is the spectrogram i generated using preprocess_audio.py and the spectrogram provided by you guys, respectively.
Generated: [0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
...
[0.01976 0.02803 0.02625 ... 0.1076 0.07574 0.05112 ]
[0.005005 0.008064 0.01441 ... 0.05185 0.03864 0.02725 ]
[0.0004032 0.0003197 0.0002983 ... 0.004044 0.004147 0.004253 ]]
Provided: [[9.7046e-03 1.0124e-02 4.0359e-03 ... 8.3984e-02 1.0144e-01 6.7139e-02]
[1.8213e-01 3.6499e-01 5.6641e-01 ... 6.5735e-02 7.7759e-02 6.1279e-02]
[1.8584e+00 2.1426e+00 2.4414e+00 ... 1.3447e+00 8.3691e-01 4.7192e-01]
...
[6.7344e+00 6.2734e+00 6.8320e+00 ... 4.3516e+00 4.3516e+00 3.0762e+00]
[6.7617e+00 5.9805e+00 7.2148e+00 ... 4.4766e+00 3.9160e+00 2.4297e+00]
[6.1719e+00 6.1953e+00 6.8672e+00 ... 3.5137e+00 3.2461e+00 1.8340e+00]]
What I find puzzling is that I have not modified the preprocessing code at all. I am wondering if perhaps a different code was used to generate the spectrograms you provided?
Any guidance or insight you could provide on this matter would be greatly appreciated.
Thank you for your assistance.

f0k · 2024-11-28T10:16:31Z

Dear Deivison, could you possibly repeat this experiment with one of the publicly available datasets, to ensure we have the same audio? If you obtain the Harmonix dataset via the YouTube URLs, it's possible we end up with different files (at the very least because YouTube provides different formats to download). So please pick one of the datasets that has a source linked on the description at Zenodo. Then let us know the file you picked and again share if and how the spectrograms differ so we can look into it. It's possible we need to pin the torchaudio and/or ffmpeg version to ensure perfect reproduction.

DeivisonJordao · 2024-11-29T00:18:17Z

I was able to obtain the Harmonix version used by BeatThis (Hung et al.) and generated the spectrogram for the song 0001_12step. Here's what I got:

Generated:
[[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
...
[6.95 6.37 6.254 ... 4.707 4.28 2.604 ]
[4.625 4.562 4.55 ... 2.867 2.422 1.055 ]
[0.01285 0.01888 0.02885 ... 0.0394 0.01982 0.00802]]

Expected:

[[0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
[0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
[1.069e-02 8.430e-03 6.313e-03 ... 4.517e-03 4.044e-03 1.930e-03]
...
[6.949e+00 6.371e+00 6.254e+00 ... 4.707e+00 4.281e+00 2.604e+00]
[4.625e+00 4.562e+00 4.551e+00 ... 2.867e+00 2.422e+00 1.053e+00]
[4.932e-02 4.547e-02 4.092e-02 ... 4.922e-02 2.989e-02 3.122e-02]]

To ensure this was going to affect the train results, i ran a train using the python lauch_scripts/train.py --compile and let it execute for 30 epochs, i early stopped because the loss was not changing significantly and the other metrics were not improving.
I strongly believe it's something related to the number notation on the generated spectrograms, it seems to me as if it's getting truncated. let me know your thoughts!

f0k · 2024-11-29T13:53:01Z

for the song 0001_12step

Ok, to be sure we have the same file, I get the following checksum:

$ md5sum 0001_12step.mp3 
0da672091abf01a1a191d452318f3590  0001_12step.mp3

But let's walk this through with a public file that is in .wav format, to rule out some of the possible distractors. I'll pick 00_BN1-129-Eb_comp_mix.wav from guitarset, from the audio_mono-pickup_mix.zip.

First, loading the file:

>>> from beat_this.preprocessing import load_audio
>>> wav, sr = load_audio('00_BN1-129-Eb_comp_mix.wav')
>>> wav.shape
(984506,)
>>> wav[11000:11005]
array([-0.0390625 , -0.0390625 , -0.0385437 , -0.03781128, -0.03753662])

Resampling:

>>> import soxr
>>> wav = soxr.resample(wav, sr, 22050)
>>> wav.shape
(492253,)
>>> wav[11000:11005]
array([-0.0120177 , -0.01190949, -0.01210577, -0.01253254, -0.0126941 ])

Quantizing to 16 bit and back (because the preprocessing script goes via 16-bit .wav files):

>>> wav = np.round(wav * 32768).astype(int) / 32768
>>> wav[11000:11005]
array([-0.01202393, -0.01190186, -0.01211548, -0.01254272, -0.01269531])

Spectrogram:

>>> from beat_this.preprocessing import LogMelSpect
>>> lms = LogMelSpect()
>>> spect = lms(torch.from_numpy(wav).float())
>>> spect.shape
torch.Size([1117, 128])
>>> spect.numpy().astype(np.float16)
array([[1.535 , 0.877 , 0.3423, ..., 0.8037, 0.7173, 0.7153],
       [4.043 , 4.066 , 4.05  , ..., 0.9434, 0.706 , 0.2277],
       [4.1   , 4.46  , 4.71  , ..., 0.8335, 0.664 , 0.2898],
       ...,
       [1.48  , 1.095 , 0.5312, ..., 0.735 , 0.7686, 0.2986],
       [1.042 , 0.9995, 1.0205, ..., 0.7456, 0.658 , 0.2261],
       [2.154 , 0.8213, 1.528 , ..., 1.37  , 1.393 , 0.68  ]],
      dtype=float16)

Comparing to the distributed data:

>>> from beat_this.dataset.mmnpz import MemmappedNpzFile
>>> spect2 = MemmappedNpzFile('data/audio/spectrograms/guitarset.npz')['00_BN1-129-Eb_comp_mix/track']
>>> spect2.shape
(1117, 128)
>>> spect2
memmap([[1.538 , 0.879 , 0.3528, ..., 0.8276, 0.7295, 0.707 ],
        [4.043 , 4.066 , 4.05  , ..., 0.9385, 0.699 , 0.2213],
        [4.1   , 4.46  , 4.71  , ..., 0.8345, 0.654 , 0.264 ],
        ...,
        [1.481 , 1.093 , 0.5317, ..., 0.7446, 0.7705, 0.309 ],
        [1.041 , 0.999 , 1.02  , ..., 0.7407, 0.6333, 0.2134],
        [2.156 , 0.8257, 1.53  , ..., 1.358 , 1.397 , 0.675 ]],
       dtype=float16)

This is as close as I can get. There is still a small mismatch between the distributed files for guitarset (which were computed in June) and the one I get with the current code and dependencies, which are:

>>> torchaudio.__version__
'2.3.1'
>>> torchaudio.utils.ffmpeg_utils.get_versions()
{'libavcodec': (58, 91, 100),
 'libavdevice': (58, 10, 100),
 'libavfilter': (7, 85, 100),
 'libavformat': (58, 45, 100),
 'libavutil': (56, 51, 100)}
>>> soxr.__version__
'0.3.7'

But I strongly doubt that these differences have an effect for training or evaluation.

To ensure this was going to affect the train results, i ran a train using the python lauch_scripts/train.py --compile and let it execute for 30 epochs

After 30 epochs, you should see over .90 F1 score on the validation set (if using all datasets). Can you try a training run with only the datasets from Zenodo first?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problematic spectrograms #6

Problematic spectrograms #6

DeivisonJordao commented Nov 27, 2024

f0k commented Nov 28, 2024

DeivisonJordao commented Nov 29, 2024

f0k commented Nov 29, 2024

Problematic spectrograms #6

Problematic spectrograms #6

Comments

DeivisonJordao commented Nov 27, 2024

f0k commented Nov 28, 2024

DeivisonJordao commented Nov 29, 2024

f0k commented Nov 29, 2024