Audio range out of (-1,+1) #1254

KarelVesely84 · 2024-01-04T13:25:16Z

Hello @pzelasko, @csukuangfj,
I just identified an open question related to the audio transforms.

In lhotse, there is the Resample class wrapping the torchaudio.transforms.Resample().

When resampling 32kHz->16kHz common_voice_cs_26209290, the audio.max() becomes 1.0079

In streaming_decode.py in Icefall, there is a check that max audio sample must be s<=1.0

What would be the cleanest solution to this ?
a) Stop checking for audio.abs().max()<=1.0 in Icefall.
b) Introduce audio clipping AudioTransform to lhotse.
c) Introduce audio Limiter AudioTransform (sth. like: if audio.abs().max() > 0.99: rescale_to_099(...)) to lhotse.
d) Try to add a check to torchaudio, into torchaudio.transforms.Resample().

I guess a similar issue would appear also for volume perturbation, but I did not check that specifically.

Best regards
Karel

// Ps: All the best in the new "western" year !!

The text was updated successfully, but these errors were encountered:

csukuangfj · 2024-01-04T13:34:14Z

The -1 to 1 check in icefall is to avoid the issue when users pass samples in the range -32768 to 32767 to the model, which is the default behavior in Kaldi.

I think it is safe to enlarge the range as long as we can achieve the same goal.

- some AudioTransform classes produce audio signals out of range [-1,+1] - Resample produced 1.0079 - The range [-10,+10] was chosen to still be able to reliably distinguish from the [-32k,+32k] signal... - this is related to : lhotse-speech/lhotse#1254

pzelasko · 2024-01-04T21:14:55Z

Hi Karel! We had another issue related to this somewhere. Technically we could either add conditional rescaling (if np.max(np.abs(audio)) > 1.0, then divide audio by maxabs value) or a limiter (I have one in a separate pip package https://github.com/pzelasko/cylimiter), but I'm just not sure if it's worth paying the runtime cost. If it's not a strict requirement in Icefall I think it's OK to leave it as it is.

…1448) - some AudioTransform classes produce audio signals out of range [-1,+1] - Resample produced 1.0079 - The range [-10,+10] was chosen to still be able to reliably distinguish from the [-32k,+32k] signal... - this is related to : lhotse-speech/lhotse#1254

KarelVesely84 mentioned this issue Jan 4, 2024

streaming_decode.py, relax the audio range from [-1,+1] to [-10,+10] k2-fsa/icefall#1448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio range out of (-1,+1) #1254

Audio range out of (-1,+1) #1254

KarelVesely84 commented Jan 4, 2024 •

edited

csukuangfj commented Jan 4, 2024

pzelasko commented Jan 4, 2024

Audio range out of (-1,+1) #1254

Audio range out of (-1,+1) #1254

Comments

KarelVesely84 commented Jan 4, 2024 • edited

csukuangfj commented Jan 4, 2024

pzelasko commented Jan 4, 2024

KarelVesely84 commented Jan 4, 2024 •

edited