Voice Assistant

A voice assistant that captures application audio, performs Voice Activity Detection (VAD), Speech Recognition, and supports real-time translation. Supports both Windows and Linux platforms.

Features

Cross-platform support (Windows and Linux)
Application-specific audio capture (using PulseAudio on Linux, Windows Audio Session API on Windows)
Real-time Voice Activity Detection (VAD)
Speech recognition using Sherpa-onnx
Automatic language detection
Real-time translation support
Optimized 16kHz audio sampling for speech recognition
Automatic audio format conversion (stereo to mono, resampling)

Dependencies

Linux

PulseAudio development library (libpulse-dev)
Sherpa-onnx
C++17 compiler
CMake 3.10 or higher

Windows

Windows 10 or higher
Visual Studio 2019 or higher with C++17 support
CMake 3.10 or higher
Sherpa-onnx

Building

Linux

mkdir -p build
cd build
cmake ..
make

Windows

mkdir build
cd build
cmake -G "Visual Studio 16 2019" -A x64 ..
cmake --build . --config Release

Model Download

Before running the voice assistant, you need to download the required models:

# Create model directory
mkdir -p models/whisper

# Download Sense Voice model (choose one of the following)
# 1. Standard model (more accurate, larger file)
wget https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/model.onnx -O models/model.onnx

# 2. Quantized model (faster, smaller file)
wget https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/model.int8.onnx -O models/model.int8.onnx

# Download Sense Voice tokens file
wget https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/tokens.txt -O models/tokens.txt

# Download Whisper model (choose one of the following)
# 1. Tiny model (faster, moderate accuracy)
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-tiny.tar.bz2
tar xvf sherpa-onnx-whisper-tiny.tar.bz2 -C models/whisper/
rm sherpa-onnx-whisper-tiny.tar.bz2

# 2. Base model (more accurate, slower)
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-base.tar.bz2
tar xvf sherpa-onnx-whisper-base.tar.bz2 -C models/whisper/
rm sherpa-onnx-whisper-base.tar.bz2

# Download VAD model
wget https://github.com/snakers4/silero-vad/raw/master/src/silero_vad/data/silero_vad.onnx -O models/silero_vad.onnx

Note: After downloading, make sure to update the model paths in your config.yaml accordingly.

Usage

List Available Audio Sources

./voice_assistant -l

This will display a list of applications currently playing audio, along with their sink input indices.

Start Voice Assistant

./voice_assistant -s <sink-input-index> -m config.yaml

Parameters:

-s, --source <index>: Specify the sink input index to capture
-m, --model <path>: Path to the model configuration file
-l, --list: List available audio sources

Configuration

Model Configuration (config.yaml)

provider: "cpu"
num_threads: 4
debug: false

model:
  type: "sense_voice"
  sense_voice:
    model_path: "models/model.onnx"
    tokens_path: "models/tokens.txt"
    language: "auto"  # Supports "zh", "en", "ja", "ko", "yue", etc.
    decoding_method: "greedy_search"
    use_itn: true

vad:
  model_path: "models/silero_vad.onnx"
  threshold: 0.5  # Voice detection threshold (0.0-1.0)
  min_silence_duration: 0.5  # Minimum silence duration (seconds)
  min_speech_duration: 0.25  # Minimum speech segment duration (seconds)
  max_speech_duration: 30.0  # Maximum speech segment duration (seconds)
  window_size: 512  # Processing window size (samples)
  sample_rate: 16000  # Sample rate (Hz)
  num_threads: 1  # VAD processing threads
  debug: false  # Enable debug output

translator:
  type: "deeplx"  # Translation service type
  target_language: "zh"  # Target language for translation

Output Format

The voice assistant outputs recognition results in the following format:

[Recognition Result]
Time: 1.234s -- 2.345s
Text: <recognized text>
Language Code: <detected language>
Target Language: <translation target>
Translated Text: <translated text> (if language differs from target)
--------------------------------------------------

Implementation Details

Audio Capture

Linux: Uses PulseAudio's threaded mainloop for efficient audio capture
Windows: Uses Windows Audio Session API (WASAPI) for low-latency audio capture
Automatic format conversion:
- Stereo to mono conversion
- Sample rate conversion to 16kHz
- 16-bit PCM format

Voice Activity Detection

Uses Silero VAD model through Sherpa-onnx
Configurable silence/speech duration parameters
Window-based real-time processing

Speech Recognition

Supports multiple model types through Sherpa-onnx
Real-time processing of detected speech segments
Language detection and recognition timestamps

Translation

Supports real-time translation through DeepLX
Automatic language detection and translation
Configurable target language

Notes

Speech recognition requires appropriate model files and configuration
Use Ctrl+C to stop the application
Ensure target application is playing audio before capturing
Use -l option to check available audio sources before starting
Translation is performed only when detected language differs from target language

Acknowledgments

This project is built upon and inspired by the following excellent projects:

sherpa-onnx: A powerful speech recognition toolkit that provides high-quality speech recognition capabilities
obs-studio: The audio capture implementation references OBS Studio's PulseAudio capture module

We are grateful to the developers of these projects for their outstanding work.

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
.github/workflows		.github/workflows
cmake/finders		cmake/finders
config		config
dep		dep
doc		doc
src		src
test		test
.env		.env
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Assistant

Features

Dependencies

Linux

Windows

Building

Linux

Windows

Model Download

Usage

List Available Audio Sources

Start Voice Assistant

Configuration

Model Configuration (config.yaml)

Output Format

Implementation Details

Audio Capture

Voice Activity Detection

Speech Recognition

Translation

Notes

Acknowledgments

About

Releases 7

Packages

Languages

License

hewenyu/voice-assistant

Folders and files

Latest commit

History

Repository files navigation

Voice Assistant

Features

Dependencies

Linux

Windows

Building

Linux

Windows

Model Download

Usage

List Available Audio Sources

Start Voice Assistant

Configuration

Model Configuration (config.yaml)

Output Format

Implementation Details

Audio Capture

Voice Activity Detection

Speech Recognition

Translation

Notes

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages