Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🗣️ feat: STT & TTS #1603

Draft
wants to merge 191 commits into
base: main
Choose a base branch
from
Draft

🗣️ feat: STT & TTS #1603

wants to merge 191 commits into from

Conversation

berry-13
Copy link
Collaborator

@berry-13 berry-13 commented Jan 20, 2024

Summary

For STT, press the button or use Shift + Alt + L

image

For TTS, press the button (if you hold the click, you can download the audio file)

image

checklist

STT

  • Browser
  • OpenAI Whisper
  • Local Whisper (tested on LocalAI and HomeAssistant Whisper)
  • Azure Whisper (not tested yet but it should work)
  • All the OpenAI compatible STT

TTS

  • Browser
  • Elevenlabs
  • OpenAI TTS
  • Piper
  • Coqui
  • All the OpenAI compatible TTS

TODO:

  • fix hark 🤔
  • improve STTBrowser error handling
  • handle audio files in the file upload and automatically transcribe them

UI

image

image

image

image

Speech TAB Explanation

NOTE: This is an explanation of how the automatic conversation works. To use it, you need to enable all of the settings in the Speech tab. This feature is still in beta, and sometimes it may not work as expected. Right now, after the AI input, I'm still not triggering the TTS call

graph TD;

    UserRequest((User Requests STT)) --> CheckLocalStorage{Check Local Storage for Engine};
    CheckLocalStorage -->|Engine Browser| AutomaticBrowser((Automatic Browser STT));
    CheckLocalStorage -->|Engine External| ExternalCheck{Check Transcription Status};
    ExternalCheck -->|Transcription Active| StopTranscription;
    ExternalCheck -->|Transcription Inactive| ListenAudio((Listen to User Audio));
    ListenAudio --> CheckAudio{Check Audio Level};
    CheckAudio -->|Below Threshold| SaveAudio;
    CheckAudio -->|Above Threshold| ContinueRecording;
    SaveAudio --> DataProviderRequest((Data Provider Request));
    DataProviderRequest --> APICall("/api/files/stt");
    APICall -->|Success| SetText((Set Text in Text Area));
    SetText -->|Auto Send Text Enabled| AutoSendRequest((Auto Send Text Request));
    AutoSendRequest --> APICall2("/chat/completions");
    APICall2 -->|Success| TriggerTTS((Trigger TTS));
    TriggerTTS --> TTSRequest((TTS Request));
    TTSRequest --> APICall3("/api/files/tts");
    APICall3 -->|Success| PlayAudio((Play Audio));
    PlayAudio -->|Playback Finished| WaitTwoSeconds;
    WaitTwoSeconds --> RepeatSTT((Repeat STT Trigger));

    subgraph Loop
    RepeatSTT --> ListenAudio;
    end

    StopTranscription((Stop Transcription));

thank you @bsu3338 for the integrated browser STT & TTS
thank you @szkiu for the Azure STT #2025


Change Type

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Testing

Checklist

  • My code adheres to this project's style guidelines
  • I have performed a self-review of my own code
  • I have commented in any complex areas of my code
  • I have made pertinent documentation changes
  • My changes do not introduce new warnings
  • I have written tests demonstrating that my changes are effective or that my feature works
  • Local unit tests pass with my changes
  • Any changes dependent on mine have been merged and published in downstream modules.

@berry-13
Copy link
Collaborator Author

berry-13 commented Apr 27, 2024

@berry-13 when do you think you will be completely finished with this pr so that it is ready to merge?

when I commit, it means the changes are ready for merging. But since @danny-avila mentioned he's going to refactor and fix some things, I'll continue until he begins reviewing it. Besides, I'll be working with him to ensure the Conversation Mode works properly since it's only partially functional at the moment

@kneelesh48
Copy link
Contributor

@berry-13 have you added support for Azure and GCP TTS in this PR?
Those are the OG TTS models. Also, eleven labs is expensive and I don't like their subscription pricing model.

@berry-13
Copy link
Collaborator Author

berry-13 commented May 8, 2024

@berry-13 have you added support for Azure and GCP TTS in this PR? Those are the OG TTS models. Also, eleven labs is expensive and I don't like their subscription pricing model.

I personally use Elevenlabs. It has websocket support and one of the best TTS models out there. I can't add Azure TTS because I don't have a key (I can't). Google TTS is planned, and I'm working on adding support for multiple providers. I'll also be adding some other providers in the future

@kneelesh48
Copy link
Contributor

@berry-13 I can provide you an azure key

@kneelesh48 kneelesh48 mentioned this pull request May 14, 2024
@bnord01
Copy link

bnord01 commented May 16, 2024

FYI: The current implementation crashes the whole application on login in Firefox.

Unexpected Application Error!
SpeechRecognition is not a constructor

initializeSpeechRecognition@http://localhost:3090/src/hooks/Input/useSpeechToTextBrowser.ts:2127:25
useSpeechToTextBrowser/<@http://localhost:3090/src/hooks/Input/useSpeechToTextBrowser.ts:2150:25
``

@berry-13
Copy link
Collaborator Author

berry-13 commented May 16, 2024

FYI: The current implementation crashes the whole application on login in Firefox.

Unexpected Application Error!
SpeechRecognition is not a constructor

initializeSpeechRecognition@http://localhost:3090/src/hooks/Input/useSpeechToTextBrowser.ts:2127:25
useSpeechToTextBrowser/<@http://localhost:3090/src/hooks/Input/useSpeechToTextBrowser.ts:2150:25
``

oh, thank you for reporting this!
I'm going to fix this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancement: Integrate Whisper to achieve voice recognition Enhancement: Incorporate Text to Speech Models