-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming ASR model latency issue #5729
Comments
Hi @DinnoKoluh, the larger latency you see for the first chunk might come from the block processing done by the streaming Conformer encoder as it needs to fill the entire initial block with (downsampled) input frames before it can actually compute any encoder output. |
I understand, it's too bad that the time is fairly large. |
AFAIU, the decoding process will always return the best (full length) hypo (or n-best) for entire available encoded input sequence. The (label-synchrounous) beam search might pick a different path when more encoder input becomes available and thus changing words that appeared in earlier results.
I haven't run any inference on GPU so can't really comment on that. |
I understand but it is supposed to be a streaming model, so I am just mimicking the streaming by chunking a long audio file. But in practice I would expect to just have a stream of audio chunks of some fixed length and that stream can last for example And doesn't the parameter is_false reset the history (buffer), which I already mentioned? I should maybe ask the main question with an example: Is the ESPnet streaming model capable, on a live audio stream (let's say listening to a live news channel on YouTube), to produce a live transcript which lags behind the audio stream at most 500 ms (or some other fixed amount)? |
As is, it won't be able to work on a live audio stream from Youtube that runs for hours (as the decoder code keeps the entire history). You would need to implement some simple endpointing in combination with "is_false" parameter to cut the audio at appropriate times (pauses etc.) and reset the internal buffer. The delay you can control via encoder block_size, hop_size and look_ahead parameter in the model config, 500ms (except for initial phase) will be challenging but should be possible. |
Okay, thank you for the info. |
Hi. Firstly, thank you for open-sourcing the model and the libraries.
I tried testing the streaming ASR Jupyter notebook from your repository. I was running this script on a remote machine with GPU access, and as I don't have a microphone on that machine I simulated the streaming capability by splitting an audio file into chunks of fixed size and I feeding them to the model inside a loop. You can see the updated notebook and the audio file here.
If you look at the last cell of the notebook, you can see the issues I am having (I kept the output). I am also pasting the output down below for a quick look.
If you have any idea what could be the reasons for the three points I put above I would appreciate it.
Thanks!
THE NOTEBOOK OUTPUT:
`Segment: 0; Finished transcription in 0.004 seconds. Current timestamp of the file: 0.256 seconds.
Segment: 1; Finished transcription in 0.004 seconds. Current timestamp of the file: 0.512 seconds.
Segment: 2; Finished transcription in 0.003 seconds. Current timestamp of the file: 0.768 seconds.
Segment: 3; Finished transcription in 0.003 seconds. Current timestamp of the file: 1.024 seconds.
Segment: 4; Finished transcription in 0.003 seconds. Current timestamp of the file: 1.28 seconds.
Segment: 5; Finished transcription in 0.018 seconds. Current timestamp of the file: 1.536 seconds.
Segment: 6; Finished transcription in 0.003 seconds. Current timestamp of the file: 1.792 seconds.
Segment: 7; Finished transcription in 0.039 seconds. Current timestamp of the file: 2.048 seconds.
Segment: 7: Text:
Segment: 8; Finished transcription in 0.003 seconds. Current timestamp of the file: 2.304 seconds.
Segment: 8: Text:
Segment: 9; Finished transcription in 0.093 seconds. Current timestamp of the file: 2.56 seconds.
Segment: 9: Text: their s
Segment: 10; Finished transcription in 0.003 seconds. Current timestamp of the file: 2.816 seconds.
Segment: 10: Text: their s
Segment: 11; Finished transcription in 0.213 seconds. Current timestamp of the file: 3.072 seconds.
Segment: 11: Text: their surface are
Segment: 12; Finished transcription in 0.004 seconds. Current timestamp of the file: 3.328 seconds.
Segment: 12: Text: their surface are
Segment: 13; Finished transcription in 0.172 seconds. Current timestamp of the file: 3.584 seconds.
Segment: 13: Text: their surface areas at close
Segment: 14; Finished transcription in 0.003 seconds. Current timestamp of the file: 3.84 seconds.
Segment: 14: Text: their surface areas at close
Segment: 15; Finished transcription in 0.220 seconds. Current timestamp of the file: 4.096 seconds.
Segment: 15: Text: their surface areas at close to some of the
Segment: 16; Finished transcription in 0.004 seconds. Current timestamp of the file: 4.352 seconds.
Segment: 16: Text: their surface areas at close to some of the
Segment: 17; Finished transcription in 0.200 seconds. Current timestamp of the file: 4.608 seconds.
Segment: 17: Text: their surface areas at close to some of the smaller st
Segment: 18; Finished transcription in 0.004 seconds. Current timestamp of the file: 4.864 seconds.
Segment: 18: Text: their surface areas at close to some of the smaller st
Segment: 19; Finished transcription in 0.290 seconds. Current timestamp of the file: 5.12 seconds.
Segment: 19: Text: their surface areas at close to some of the smaller states on ear
Segment: 20; Finished transcription in 0.003 seconds. Current timestamp of the file: 5.376 seconds.
Segment: 20: Text: their surface areas at close to some of the smaller states on ear
****************************************************************************************************`
The text was updated successfully, but these errors were encountered: