accuracy of speaker detection #142

Paranoidal97 · 2023-12-05T20:53:26Z

Paranoidal97
Dec 5, 2023

Is it possible to improve the accuracy of speaker detection? I am not talking about overlapping voices but, for example, about long speeches by one speaker during which another speaker says, e.g. "yeah"?

filmo · 2024-01-28T18:55:30Z

filmo
Jan 28, 2024

You can try integrating pyannote as an alternative to NeMO.

In general, I've found that speaker diarization significantly lags the improvements in transcription (in other words, while whisper transcription is pretty amazing and almost a 'solved problem', the same is not true of diarization.).

In particular small interjections and quick changes of turn are often not recognized by either NeMo (default for this repository) or pyannote. They use orthogonal methods, so I have found that by using both, there will often be times where one is better than the other depending on the source audio quality. NeMO is provided with the 'telephonic' model and while the package shows 'general' and 'meeting', Nvidia as not released those models for whatever reason. ("They're not ready" was the last response I got for those a while back.) Pyannote seems to do better when there's a big discrepancy in the relative volumes of speakers voices. (Example: one speaker materially louder than the other.)

NeMO has some setting that you can try tweaking. I'm not sure how they integrate with the telephonic model that Nvidia has release. Meaning: I don't know if you change the preset telephonic.yaml if it adjust the NeMo model or just messes it up. Worth a shot. There are some 'minimum' length settings in that file.

The point being, if you want better accuracy of speaker detection, investigate various speaker diarization projects. Fortunately they seem to all output the same RTTM files, so once you have that, you could integrate it into rest of this project if you find something useful.

0 replies

kc01-8 · 2024-09-08T21:47:05Z

kc01-8
Sep 8, 2024

On the note of accuracy, is there any options to disable or prevent overlap?

1 reply

MahmoudAshraf97 Sep 8, 2024
Maintainer

The only possible solution is to isolate each speaker first and then transcriping, because whisper doesn't handle overlapping speech too

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accuracy of speaker detection #142

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

accuracy of speaker detection #142

Paranoidal97 Dec 5, 2023

Replies: 2 comments · 1 reply

filmo Jan 28, 2024

kc01-8 Sep 8, 2024

MahmoudAshraf97 Sep 8, 2024 Maintainer

Paranoidal97
Dec 5, 2023

Replies: 2 comments 1 reply

filmo
Jan 28, 2024

kc01-8
Sep 8, 2024

MahmoudAshraf97 Sep 8, 2024
Maintainer