Skip to content

Multi channel audio merging #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 3, 2025
Merged

Conversation

flashno
Copy link
Contributor

@flashno flashno commented Mar 28, 2025

Previous versions of WhisperKit was only using the first channel of Audio, regardless of whether it contained 2 or more channels. This update allows developers to combine all channels available, or choose specific channels to combine to feed into the transcription. To set the configuration on load of WhisperKit, you can pass it through to WhisperKitConfig similar to this:

        let config = WhisperKitConfig(
            ...
            audioInputConfig: AudioInputConfig(channelMode: .sumChannels([1, 3, 5]))
        )

This will sum the 2nd, 4th, and 6th channel (the first channel is index 0).
Another way to pass the channelMode is to directly feed it into the loadAudio function like this:

        let audioArray = try AudioProcessor.loadAudioAsFloatArray(
            fromPath: audioPath,
            channelMode: .specificChannel(0)
        )

This option has been added to all audio processing API's that would read audio.

⚠️ Important: This PR changes the default behavior to now sum and normalize all channels if the audio is not mono to begin with.
The audio merging algorithm works like this:
We find the peak across all channels, check if the peak of the mono (summed) version is higher than any of the peaks of the channels, then we multiply the whole track so that the peak of the mono channel matches the peak of the loudest channel.

Eg: Top mono (merged) buffer, bottom individual channels (pre-merge)
image

Here you can see how the merged audio maintains the same loudness as the original multi-channel audio file, and you can see the total merged waveform of all the channels.

Resolves these two issues:
#134
#313

Rik Basu and others added 4 commits March 27, 2025 10:18
Developers can pick any individual audio channel, sum all channels, or sum specific channels.
Set channelMode in WhisperKitConfig and pass the AudioInputConfig.
This maintains the integrity of the file.
Changed unit test to use 10 mins of audio.
Copy link
Contributor

@ZachNagengast ZachNagengast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@ZachNagengast ZachNagengast merged commit 0a9bc42 into main Apr 3, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for multi channel audio Audio input captures only the first channel
2 participants