Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with constant hallucinations #121

Open
J-Korn opened this issue Sep 11, 2024 · 4 comments
Open

Dealing with constant hallucinations #121

J-Korn opened this issue Sep 11, 2024 · 4 comments

Comments

@J-Korn
Copy link

J-Korn commented Sep 11, 2024

Using the large-v3 model to transcribe greek audio from a live stream, I am often met with continuous results writing "Υπότιτλοι AUTHORWAVE"

It seems the model is bugged in a way that outputs that phrase when it does not understand the input.

Setting vac and vad to True dos not seem to reduce that occurrence.

Is there some way I can discard this specific phrase or similar ones so they do not get confirmed and sent to the client?

@Gldkslfmsd
Copy link
Collaborator

Gldkslfmsd commented Sep 12, 2024

hi, you can check whether it's the same with offline Whisper model with VAD on.
If yes, then a better model can help, or make sure that the sound quality is good enough.

Alternatively, just remove that phrase from all transcripts before searching for the longest common prefix. But beware, it won't output it when you actually need it. And it may not work whenever Whisper hallucinates anything else.

@J-Korn
Copy link
Author

J-Korn commented Sep 12, 2024

This is not any actual greek phrase. My best guess is that the model was partially trained in greek using community generated subtitles for tv shows and whatnot, and they had the creator's name as an advertisement during moments of silence where actual captioning was not needed.
"Υπότιτλοι" translates to "Subtitles" and "AUTHORWAVE" is not a greek word, or any word that means anything for that matter.

This is using the large-v3 model and I cannot find any model that does greek better than this. Do note that this also shows up when transcribing videos with the base Whisper.

For now I am attempting to remove it like this:


 if self.contains_unwanted_word(tsw, "AUTHORWAVE"):
            logger.debug("Discarding transcription result due to unwanted word 'AUTHORWAVE'")
            return None, None, ""

Where I check the words retrieved from self.asr.ts_words(res) during process_iter and return early if this is found.
I am not sure this is the correct way to go about it though.

Edit:
Related to the topic, I am also trying to add a check where if 5 seconds have passed without confirming a chunk, to just confirm everything it's got in the buffer, in an attempt to improve time at the cost of accuracy,
This is my current process_iter, but I feel this particular change needs to be done someplace else:

def process_iter(self):
        """Runs on the current audio buffer.
        Returns: a tuple (beg_timestamp, end_timestamp, "text"), or (None, None, "").
        The non-emty text is confirmed (committed) partial transcript.
        """
        prompt, non_prompt = self.prompt()
        logger.debug(f"PROMPT: {prompt}")
        logger.debug(f"CONTEXT: {non_prompt}")
        logger.debug(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}")
        res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)

        tsw = self.asr.ts_words(res)

        # Check if 'AUTHORWAVE' is in the transcription result
        if self.contains_unwanted_word(tsw, "AUTHORWAVE"):
            logger.debug("Discarding transcription result due to unwanted word 'AUTHORWAVE'")
            return None, None, ""

        self.transcript_buffer.insert(tsw, self.buffer_time_offset)
        o = self.transcript_buffer.flush()
        if o:
            self.commited.extend(o)
            self.last_confirmed_time = time.time()
            completed = self.to_flush(o)
            logger.debug(f">>>>COMPLETE NOW: {completed}")
        else:
            completed = None

        current_time = time.time()
        if current_time - self.last_confirmed_time > self.confirmation_timeout:
            logger.debug("Timeout exceeded. Forcing confirmation of available text.")
            self.force_confirm_text()

        the_rest = self.to_flush(self.transcript_buffer.complete())
        logger.debug(f"INCOMPLETE: {the_rest}")

                # there is a newly confirmed text

        if o and self.buffer_trimming_way == "sentence":  # trim the completed sentences
            if len(self.audio_buffer)/self.SAMPLING_RATE > self.buffer_trimming_sec:  # longer than this
                self.chunk_completed_sentence()


        if self.buffer_trimming_way == "segment":
            s = self.buffer_trimming_sec  # trim the completed segments longer than s,
        else:
            s = 30 # if the audio buffer is longer than 30s, trim it

        if len(self.audio_buffer)/self.SAMPLING_RATE > s:
            self.chunk_completed_segment(res)

        logger.debug(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}")
        return self.to_flush(o)

@J-Korn
Copy link
Author

J-Korn commented Sep 17, 2024

Any help on this matter would be greatly appreciated.

@Gldkslfmsd
Copy link
Collaborator

Hi, I'd like to help but I'm busy now. Small advice:
Create a "development set" -- an audio on which the hallucination happens, on which can you measure the ASR quality quickly -- preferrably by WER compared to gold transcript, or at least by counting the number of the hallucinated words. Measure the quality with your change/with various parameters and without. Use it for decision whether to apply the change or not.

Btw. -- latency measure should be applied as well but can be neglected for start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants