Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming ASR model latency issue #5729

Open
DinnoKoluh opened this issue Apr 2, 2024 · 6 comments
Open

Streaming ASR model latency issue #5729

DinnoKoluh opened this issue Apr 2, 2024 · 6 comments

Comments

@DinnoKoluh
Copy link

Hi. Firstly, thank you for open-sourcing the model and the libraries.

I tried testing the streaming ASR Jupyter notebook from your repository. I was running this script on a remote machine with GPU access, and as I don't have a microphone on that machine I simulated the streaming capability by splitting an audio file into chunks of fixed size and I feeding them to the model inside a loop. You can see the updated notebook and the audio file here.

If you look at the last cell of the notebook, you can see the issues I am having (I kept the output). I am also pasting the output down below for a quick look.

  1. Firstly, it takes around $2.56$ seconds ($10$ chunks of $256$ ms) for the first transcript to appear even though in the audio the speech starts from the start of the file (maybe at $350$ ms). Then, I get text updates only for every second chunk, and you can see that also by the time it took to transcribe each chunk. When I tried to split the initial file into more chunks of shorter length, I would get a similar behaviour; there would be just more skips between text updates.
  2. I also noticed that the time (latency) it takes for the transcription of chunks increases as I am adding new chunks. I tried to set the is_false flag to true for every tenth chunk in hope it would reset the buffer but I didn't see any effect. My guess for the increase in latency is that for each update of the transcription I get the whole transcription back instead of just the update. I am not sure if that is the desired behaviour but it would seem to me that the more natural behaviour is to have transcription updates for the, maybe, last $5$ words instead the whole transcript.
  3. I would also like to add, that changing the device from GPU to CPU or vice-versa doesn't have any effect on the latency which is odd, I would expect that switching to a GPU would decrease the latency by a lot.

If you have any idea what could be the reasons for the three points I put above I would appreciate it.
Thanks!

THE NOTEBOOK OUTPUT:
`Segment: 0; Finished transcription in 0.004 seconds. Current timestamp of the file: 0.256 seconds.


Segment: 1; Finished transcription in 0.004 seconds. Current timestamp of the file: 0.512 seconds.


Segment: 2; Finished transcription in 0.003 seconds. Current timestamp of the file: 0.768 seconds.


Segment: 3; Finished transcription in 0.003 seconds. Current timestamp of the file: 1.024 seconds.


Segment: 4; Finished transcription in 0.003 seconds. Current timestamp of the file: 1.28 seconds.


Segment: 5; Finished transcription in 0.018 seconds. Current timestamp of the file: 1.536 seconds.


Segment: 6; Finished transcription in 0.003 seconds. Current timestamp of the file: 1.792 seconds.


Segment: 7; Finished transcription in 0.039 seconds. Current timestamp of the file: 2.048 seconds.
Segment: 7: Text:


Segment: 8; Finished transcription in 0.003 seconds. Current timestamp of the file: 2.304 seconds.
Segment: 8: Text:


Segment: 9; Finished transcription in 0.093 seconds. Current timestamp of the file: 2.56 seconds.
Segment: 9: Text: their s


Segment: 10; Finished transcription in 0.003 seconds. Current timestamp of the file: 2.816 seconds.
Segment: 10: Text: their s


Segment: 11; Finished transcription in 0.213 seconds. Current timestamp of the file: 3.072 seconds.
Segment: 11: Text: their surface are


Segment: 12; Finished transcription in 0.004 seconds. Current timestamp of the file: 3.328 seconds.
Segment: 12: Text: their surface are


Segment: 13; Finished transcription in 0.172 seconds. Current timestamp of the file: 3.584 seconds.
Segment: 13: Text: their surface areas at close


Segment: 14; Finished transcription in 0.003 seconds. Current timestamp of the file: 3.84 seconds.
Segment: 14: Text: their surface areas at close


Segment: 15; Finished transcription in 0.220 seconds. Current timestamp of the file: 4.096 seconds.
Segment: 15: Text: their surface areas at close to some of the


Segment: 16; Finished transcription in 0.004 seconds. Current timestamp of the file: 4.352 seconds.
Segment: 16: Text: their surface areas at close to some of the


Segment: 17; Finished transcription in 0.200 seconds. Current timestamp of the file: 4.608 seconds.
Segment: 17: Text: their surface areas at close to some of the smaller st


Segment: 18; Finished transcription in 0.004 seconds. Current timestamp of the file: 4.864 seconds.
Segment: 18: Text: their surface areas at close to some of the smaller st


Segment: 19; Finished transcription in 0.290 seconds. Current timestamp of the file: 5.12 seconds.
Segment: 19: Text: their surface areas at close to some of the smaller states on ear


Segment: 20; Finished transcription in 0.003 seconds. Current timestamp of the file: 5.376 seconds.
Segment: 20: Text: their surface areas at close to some of the smaller states on ear
****************************************************************************************************`

@espnetUser
Copy link
Contributor

espnetUser commented Apr 15, 2024

Hi @DinnoKoluh, the larger latency you see for the first chunk might come from the block processing done by the streaming Conformer encoder as it needs to fill the entire initial block with (downsampled) input frames before it can actually compute any encoder output.

@DinnoKoluh
Copy link
Author

I understand, it's too bad that the time is fairly large.
Do you have any clue about the other issues mentioned?

@espnetUser
Copy link
Contributor

My guess for the increase in latency is that for each update of the transcription I get the whole transcription back instead of just the update. I am not sure if that is the desired behaviour but it would seem to me that the more natural behaviour is to have transcription updates for the, maybe, last 5 words instead the whole transcript.

AFAIU, the decoding process will always return the best (full length) hypo (or n-best) for entire available encoded input sequence. The (label-synchrounous) beam search might pick a different path when more encoder input becomes available and thus changing words that appeared in earlier results.
Also, for very long audio the current decoder implementation will slow down noticeable because it keeps the whole "history" of encoded inputs. So for very long audio it is better to use VAD to split it into smaller segments before decoding.

I would also like to add, that changing the device from GPU to CPU or vice-versa doesn't have any effect on the latency which is odd, I would expect that switching to a GPU would decrease the latency by a lot.

I haven't run any inference on GPU so can't really comment on that.

@DinnoKoluh
Copy link
Author

I understand but it is supposed to be a streaming model, so I am just mimicking the streaming by chunking a long audio file. But in practice I would expect to just have a stream of audio chunks of some fixed length and that stream can last for example $1$ hour. So, the inference should be done on just the incoming chunks or chunks that are near the incoming one as it may need context to update the transcript.

And doesn't the parameter is_false reset the history (buffer), which I already mentioned?

I should maybe ask the main question with an example: Is the ESPnet streaming model capable, on a live audio stream (let's say listening to a live news channel on YouTube), to produce a live transcript which lags behind the audio stream at most 500 ms (or some other fixed amount)?

@espnetUser
Copy link
Contributor

I should maybe ask the main question with an example: Is the ESPnet streaming model capable, on a live audio stream (let's say listening to a live news channel on YouTube), to produce a live transcript which lags behind the audio stream at most 500 ms (or some other fixed amount)?

As is, it won't be able to work on a live audio stream from Youtube that runs for hours (as the decoder code keeps the entire history). You would need to implement some simple endpointing in combination with "is_false" parameter to cut the audio at appropriate times (pauses etc.) and reset the internal buffer. The delay you can control via encoder block_size, hop_size and look_ahead parameter in the model config, 500ms (except for initial phase) will be challenging but should be possible.

@DinnoKoluh
Copy link
Author

Okay, thank you for the info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants