Multi channel audio merging #320

flashno · 2025-03-28T01:23:13Z

Previous versions of WhisperKit was only using the first channel of Audio, regardless of whether it contained 2 or more channels. This update allows developers to combine all channels available, or choose specific channels to combine to feed into the transcription. To set the configuration on load of WhisperKit, you can pass it through to WhisperKitConfig similar to this:

        let config = WhisperKitConfig(
            ...
            audioInputConfig: AudioInputConfig(channelMode: .sumChannels([1, 3, 5]))
        )

This will sum the 2nd, 4th, and 6th channel (the first channel is index 0).
Another way to pass the channelMode is to directly feed it into the loadAudio function like this:

        let audioArray = try AudioProcessor.loadAudioAsFloatArray(
            fromPath: audioPath,
            channelMode: .specificChannel(0)
        )

This option has been added to all audio processing API's that would read audio.

⚠️ Important: This PR changes the default behavior to now sum and normalize all channels if the audio is not mono to begin with.
The audio merging algorithm works like this:
We find the peak across all channels, check if the peak of the mono (summed) version is higher than any of the peaks of the channels, then we multiply the whole track so that the peak of the mono channel matches the peak of the loudest channel.

Eg: Top mono (merged) buffer, bottom individual channels (pre-merge)

Here you can see how the merged audio maintains the same loudness as the original multi-channel audio file, and you can see the total merged waveform of all the channels.

Resolves these two issues:
#134
#313

Developers can pick any individual audio channel, sum all channels, or sum specific channels. Set channelMode in WhisperKitConfig and pass the AudioInputConfig.

This maintains the integrity of the file. Changed unit test to use 10 mins of audio.

ZachNagengast

Looks good 👍

Rik Basu and others added 4 commits March 27, 2025 10:18

Allow developers to merge audio channels.

81bef17

Developers can pick any individual audio channel, sum all channels, or sum specific channels. Set channelMode in WhisperKitConfig and pass the AudioInputConfig.

Cleanup and formatting.

f3e66c1

Added helper function to read audio channel into buffer.

22839b6

This maintains the integrity of the file. Changed unit test to use 10 mins of audio.

Format

53c1d61

flashno requested review from ZachNagengast and a2they March 28, 2025 01:23

This was linked to issues Mar 28, 2025

Audio input captures only the first channel #134

Closed

support for multi channel audio #313

Closed

ZachNagengast approved these changes Mar 28, 2025

View reviewed changes

ZachNagengast merged commit 0a9bc42 into main Apr 3, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi channel audio merging #320

Multi channel audio merging #320

Uh oh!

flashno commented Mar 28, 2025

Uh oh!

ZachNagengast left a comment

Uh oh!

Uh oh!

Uh oh!

Multi channel audio merging #320

Multi channel audio merging #320

Uh oh!

Conversation

flashno commented Mar 28, 2025

Uh oh!

ZachNagengast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!