❓ How can I deal with long audio in Speech-to-text? (e.g., 60 - 120 min) #102
-
Hi, I am trying to run the speech-to-text model on GPU/CPU for large audio file. but I got out-of-memory error from both sides. Is there any iterable lazy dataloader that can feed the audio file 10 min by 10 min? I have tried some silence-based audio segmentation, but the performance is not as the same level as silero. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, We do not have public built-in streaming interface for our models for simplicity reasons. You can try our VAD to split audio into chunks - https://github.com/snakers4/silero-vad STT best works for 5-15s audio chunks anyway. In case some chunk is longer, you may use an align method in the decoder, apply it to a fixed length chunk, split on some word and just run STT the second time on the subchunks. |
Beta Was this translation helpful? Give feedback.
Hi,
We do not have public built-in streaming interface for our models for simplicity reasons.
You can try our VAD to split audio into chunks - https://github.com/snakers4/silero-vad
STT best works for 5-15s audio chunks anyway.
In case some chunk is longer, you may use an align method in the decoder, apply it to a fixed length chunk, split on some word and just run STT the second time on the subchunks.