Dealing with constant hallucinations #121

J-Korn · 2024-09-11T11:17:46Z

Using the large-v3 model to transcribe greek audio from a live stream, I am often met with continuous results writing "Υπότιτλοι AUTHORWAVE"

It seems the model is bugged in a way that outputs that phrase when it does not understand the input.

Setting vac and vad to True dos not seem to reduce that occurrence.

Is there some way I can discard this specific phrase or similar ones so they do not get confirmed and sent to the client?

Gldkslfmsd · 2024-09-12T11:47:23Z

hi, you can check whether it's the same with offline Whisper model with VAD on.
If yes, then a better model can help, or make sure that the sound quality is good enough.

Alternatively, just remove that phrase from all transcripts before searching for the longest common prefix. But beware, it won't output it when you actually need it. And it may not work whenever Whisper hallucinates anything else.

J-Korn · 2024-09-12T11:59:54Z

This is not any actual greek phrase. My best guess is that the model was partially trained in greek using community generated subtitles for tv shows and whatnot, and they had the creator's name as an advertisement during moments of silence where actual captioning was not needed.
"Υπότιτλοι" translates to "Subtitles" and "AUTHORWAVE" is not a greek word, or any word that means anything for that matter.

This is using the large-v3 model and I cannot find any model that does greek better than this. Do note that this also shows up when transcribing videos with the base Whisper.

For now I am attempting to remove it like this:


 if self.contains_unwanted_word(tsw, "AUTHORWAVE"):
            logger.debug("Discarding transcription result due to unwanted word 'AUTHORWAVE'")
            return None, None, ""

Where I check the words retrieved from self.asr.ts_words(res) during process_iter and return early if this is found.
I am not sure this is the correct way to go about it though.

Edit:
Related to the topic, I am also trying to add a check where if 5 seconds have passed without confirming a chunk, to just confirm everything it's got in the buffer, in an attempt to improve time at the cost of accuracy,
This is my current process_iter, but I feel this particular change needs to be done someplace else:

def process_iter(self):
        """Runs on the current audio buffer.
        Returns: a tuple (beg_timestamp, end_timestamp, "text"), or (None, None, "").
        The non-emty text is confirmed (committed) partial transcript.
        """
        prompt, non_prompt = self.prompt()
        logger.debug(f"PROMPT: {prompt}")
        logger.debug(f"CONTEXT: {non_prompt}")
        logger.debug(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}")
        res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)

        tsw = self.asr.ts_words(res)

        # Check if 'AUTHORWAVE' is in the transcription result
        if self.contains_unwanted_word(tsw, "AUTHORWAVE"):
            logger.debug("Discarding transcription result due to unwanted word 'AUTHORWAVE'")
            return None, None, ""

        self.transcript_buffer.insert(tsw, self.buffer_time_offset)
        o = self.transcript_buffer.flush()
        if o:
            self.commited.extend(o)
            self.last_confirmed_time = time.time()
            completed = self.to_flush(o)
            logger.debug(f">>>>COMPLETE NOW: {completed}")
        else:
            completed = None

        current_time = time.time()
        if current_time - self.last_confirmed_time > self.confirmation_timeout:
            logger.debug("Timeout exceeded. Forcing confirmation of available text.")
            self.force_confirm_text()

        the_rest = self.to_flush(self.transcript_buffer.complete())
        logger.debug(f"INCOMPLETE: {the_rest}")

                # there is a newly confirmed text

        if o and self.buffer_trimming_way == "sentence":  # trim the completed sentences
            if len(self.audio_buffer)/self.SAMPLING_RATE > self.buffer_trimming_sec:  # longer than this
                self.chunk_completed_sentence()


        if self.buffer_trimming_way == "segment":
            s = self.buffer_trimming_sec  # trim the completed segments longer than s,
        else:
            s = 30 # if the audio buffer is longer than 30s, trim it

        if len(self.audio_buffer)/self.SAMPLING_RATE > s:
            self.chunk_completed_segment(res)

        logger.debug(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}")
        return self.to_flush(o)

J-Korn · 2024-09-17T12:05:02Z

Any help on this matter would be greatly appreciated.

Gldkslfmsd · 2024-09-17T16:07:05Z

Hi, I'd like to help but I'm busy now. Small advice:
Create a "development set" -- an audio on which the hallucination happens, on which can you measure the ASR quality quickly -- preferrably by WER compared to gold transcript, or at least by counting the number of the hallucinated words. Measure the quality with your change/with various parameters and without. Use it for decision whether to apply the change or not.

Btw. -- latency measure should be applied as well but can be neglected for start.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with constant hallucinations #121

Dealing with constant hallucinations #121

J-Korn commented Sep 11, 2024

Gldkslfmsd commented Sep 12, 2024 •

edited

Loading

J-Korn commented Sep 12, 2024 •

edited

Loading

J-Korn commented Sep 17, 2024

Gldkslfmsd commented Sep 17, 2024

Dealing with constant hallucinations #121

Dealing with constant hallucinations #121

Comments

J-Korn commented Sep 11, 2024

Gldkslfmsd commented Sep 12, 2024 • edited Loading

J-Korn commented Sep 12, 2024 • edited Loading

J-Korn commented Sep 17, 2024

Gldkslfmsd commented Sep 17, 2024

Gldkslfmsd commented Sep 12, 2024 •

edited

Loading

J-Korn commented Sep 12, 2024 •

edited

Loading