Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio range out of (-1,+1) #1254

Open
KarelVesely84 opened this issue Jan 4, 2024 · 2 comments
Open

Audio range out of (-1,+1) #1254

KarelVesely84 opened this issue Jan 4, 2024 · 2 comments

Comments

@KarelVesely84
Copy link
Contributor

KarelVesely84 commented Jan 4, 2024

Hello @pzelasko, @csukuangfj,
I just identified an open question related to the audio transforms.

In lhotse, there is the Resample class wrapping the torchaudio.transforms.Resample().

When resampling 32kHz->16kHz common_voice_cs_26209290, the audio.max() becomes 1.0079

In streaming_decode.py in Icefall, there is a check that max audio sample must be s<=1.0

What would be the cleanest solution to this ?
a) Stop checking for audio.abs().max()<=1.0 in Icefall.
b) Introduce audio clipping AudioTransform to lhotse.
c) Introduce audio Limiter AudioTransform (sth. like: if audio.abs().max() > 0.99: rescale_to_099(...)) to lhotse.
d) Try to add a check to torchaudio, into torchaudio.transforms.Resample().

I guess a similar issue would appear also for volume perturbation, but I did not check that specifically.

Best regards
Karel

// Ps: All the best in the new "western" year !!

@csukuangfj
Copy link
Contributor

The -1 to 1 check in icefall is to avoid the issue when users pass samples in the range -32768 to 32767 to the model, which is the default behavior in Kaldi.

I think it is safe to enlarge the range as long as we can achieve the same goal.

KarelVesely84 added a commit to KarelVesely84/icefall that referenced this issue Jan 4, 2024
- some AudioTransform classes produce audio signals out of range [-1,+1]
   - Resample produced 1.0079
   - The range [-10,+10] was chosen to still be able to reliably
     distinguish from the [-32k,+32k] signal...
- this is related to : lhotse-speech/lhotse#1254
KarelVesely84 added a commit to KarelVesely84/icefall that referenced this issue Jan 4, 2024
- some AudioTransform classes produce audio signals out of range [-1,+1]
   - Resample produced 1.0079
   - The range [-10,+10] was chosen to still be able to reliably
     distinguish from the [-32k,+32k] signal...
- this is related to : lhotse-speech/lhotse#1254
KarelVesely84 added a commit to KarelVesely84/icefall that referenced this issue Jan 4, 2024
- some AudioTransform classes produce audio signals out of range [-1,+1]
   - Resample produced 1.0079
   - The range [-10,+10] was chosen to still be able to reliably
     distinguish from the [-32k,+32k] signal...
- this is related to : lhotse-speech/lhotse#1254
@pzelasko
Copy link
Collaborator

pzelasko commented Jan 4, 2024

Hi Karel! We had another issue related to this somewhere. Technically we could either add conditional rescaling (if np.max(np.abs(audio)) > 1.0, then divide audio by maxabs value) or a limiter (I have one in a separate pip package https://github.com/pzelasko/cylimiter), but I'm just not sure if it's worth paying the runtime cost. If it's not a strict requirement in Icefall I think it's OK to leave it as it is.

csukuangfj pushed a commit to k2-fsa/icefall that referenced this issue Jan 5, 2024
…1448)

- some AudioTransform classes produce audio signals out of range [-1,+1]
   - Resample produced 1.0079
   - The range [-10,+10] was chosen to still be able to reliably
     distinguish from the [-32k,+32k] signal...
- this is related to : lhotse-speech/lhotse#1254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants