accuracy of speaker detection #142
Replies: 2 comments 1 reply
-
You can try integrating pyannote as an alternative to NeMO. In general, I've found that speaker diarization significantly lags the improvements in transcription (in other words, while whisper transcription is pretty amazing and almost a 'solved problem', the same is not true of diarization.). In particular small interjections and quick changes of turn are often not recognized by either NeMo (default for this repository) or pyannote. They use orthogonal methods, so I have found that by using both, there will often be times where one is better than the other depending on the source audio quality. NeMO is provided with the 'telephonic' model and while the package shows 'general' and 'meeting', Nvidia as not released those models for whatever reason. ("They're not ready" was the last response I got for those a while back.) Pyannote seems to do better when there's a big discrepancy in the relative volumes of speakers voices. (Example: one speaker materially louder than the other.) NeMO has some setting that you can try tweaking. I'm not sure how they integrate with the telephonic model that Nvidia has release. Meaning: I don't know if you change the preset telephonic.yaml if it adjust the NeMo model or just messes it up. Worth a shot. There are some 'minimum' length settings in that file. The point being, if you want better accuracy of speaker detection, investigate various speaker diarization projects. Fortunately they seem to all output the same RTTM files, so once you have that, you could integrate it into rest of this project if you find something useful. |
Beta Was this translation helpful? Give feedback.
-
On the note of accuracy, is there any options to disable or prevent overlap? |
Beta Was this translation helpful? Give feedback.
-
Is it possible to improve the accuracy of speaker detection? I am not talking about overlapping voices but, for example, about long speeches by one speaker during which another speaker says, e.g. "yeah"?
Beta Was this translation helpful? Give feedback.
All reactions