automatic-speech-recognition transcribe phonemes #1173

IanSweeneyAC · 2025-01-29T15:34:22Z

Feature request

Example: Transcribe English w/ word-level timestamps.

would be nice to be able to transcribe audio and also get phoneme level timestamps, something like

const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url, { return_timestamps: ['word', 'phonemes'] });

Motivation

having phoneme level timestamps would allow browser processing of audio to drive lip sync of a 3D avatar. assuming phoneme data is available as a building block of words in speech recognition.

Your contribution

I can animate a 3D character, eg ready player me, in three.js from audio samples if phoneme timestamps are available

xenova · 2025-02-08T11:57:57Z

Unfortunately, I don't believe whisper has this functionality. However, you could use a separate library, like https://www.npmjs.com/package/phonemizer to extract phonemes from the input text.

IanSweeneyAC added the enhancement New feature or request label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatic-speech-recognition transcribe phonemes #1173

automatic-speech-recognition transcribe phonemes #1173

IanSweeneyAC commented Jan 29, 2025

xenova commented Feb 8, 2025

automatic-speech-recognition transcribe phonemes #1173

automatic-speech-recognition transcribe phonemes #1173

Comments

IanSweeneyAC commented Jan 29, 2025

Feature request

Motivation

Your contribution

xenova commented Feb 8, 2025