Run Qwen3-TTS text-to-speech AI locally on your MacBook with Apple Silicon (M1, M2, M3, M4). No cloud, no API keys, completely offline.
Keywords: Qwen TTS Mac, Qwen3 TTS Apple Silicon, MLX text to speech, local TTS Mac, voice cloning Mac, AI voice generator MacBook
- Voice Cloning - Clone any voice from a 5-second audio sample
- Voice Design - Create new voices by describing them ("deep narrator", "excited child")
- Custom Voices - 9 built-in voices with emotion and speed control
- 100% Local - Runs entirely on your Mac, no internet required
- Optimized for M-Series - Uses Apple's MLX framework for fast GPU inference
MLX models are specifically optimized for Apple Silicon. Compared to running standard PyTorch models:
| Metric | Standard Model | MLX Model |
|---|---|---|
| RAM Usage | 10+ GB | 2-3 GB |
| CPU Temperature | 80-90°C | 40-50°C |
Tested on M4 MacBook Air (fanless) with 1.7B models
MLX runs natively on the Apple Neural Engine and GPU, meaning better performance with less heat and battery drain.
git clone https://github.com/kapi2800/qwen3-tts-apple-silicon.git
cd qwen3-tts-apple-silicon
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
brew install ffmpegPick the models you need from the table below. Click the link, then click "Download" on HuggingFace.
Pro Models (1.7B) - Best Quality
| Model | Use Case | Download |
|---|---|---|
| CustomVoice | Preset voices + emotion control | Download |
| VoiceDesign | Create voices from text description | Download |
| Base | Voice cloning from audio | Download |
Lite Models (0.6B) - Faster, Less RAM
| Model | Use Case | Download |
|---|---|---|
| CustomVoice | Preset voices + emotion control | Download |
| VoiceDesign | Create voices from text description | Download |
| Base | Voice cloning from audio | Download |
Put downloaded folders in models/:
models/
├── Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit/
├── Qwen3-TTS-12Hz-1.7B-VoiceDesign-8bit/
└── Qwen3-TTS-12Hz-1.7B-Base-8bit/
source .venv/bin/activate
python main.py========================================
Qwen3-TTS Manager
========================================
Pro Models (1.7B - Best Quality)
---------------------------------
1. Custom Voice
2. Voice Design
3. Voice Cloning
Lite Models (0.6B - Faster)
---------------------------
4. Custom Voice
5. Voice Design
6. Voice Cloning
q. Exit
Select:
- Custom Voice: Pick from preset speakers, set emotion and speed
- Voice Design: Describe a voice (e.g., "calm British narrator")
- Voice Cloning: Provide a reference audio clip to clone
- Drag
.txtfiles directly into the terminal for long text - Voice cloning works best with clean 5-10 second audio clips
- Speed options: Normal (1.0x), Fast (1.3x), Slow (0.8x)
- Type
qorexitanytime to go back
Inference optimizations applied via monkey-patching in main.py (no .venv modifications).
Benchmarks (MacBook Pro M4 Max 128GB, 1.7B 8-bit model):
| Optimization | Steady-state tok/s | RTF | Method |
|---|---|---|---|
| Baseline | ~25 | 2.0x | — |
| + Fused Metal kernels | ~50 | 4.0x | mx.fast.rms_norm + mx.fast.rope |
| + Grouped codebook (alpha, default) | ~90 | 4-5x | 3 serial + 4 parallel groups |
| + Grouped codebook (beta) | ~100 | 5-6x | 3 serial + 2 parallel groups |
RTF = audio duration / processing time. RTF 5x means 1 second of compute produces 5 seconds of audio. First token has ~200ms warmup; steady-state numbers shown above.
Qwen3-TTS encodes audio with 16 RVQ (Residual Vector Quantization) codebooks. The Code Predictor originally predicts 15 codebooks sequentially, each step depending on the previous sample. Grouped prediction parallelizes later codebooks — sharing a single transformer forward pass and applying separate lm_head outputs.
The first 3 codebooks must remain serial: they carry the most critical coarse acoustic information (pitch, energy, fundamental timbre), and the c0→c1→c2 conditional dependency cannot be removed. Experiments confirmed that reducing to 2 serial codebooks causes fully corrupt audio. Later codebooks carry diminishing acoustic detail (resonance → high-frequency → subtle texture) and can be parallelized with minimal quality impact.
| Variable | Values | Description |
|---|---|---|
GROUPED_CODEPRED |
alpha (default), beta, off |
Codebook grouping scheme. Alpha is quality-safe; beta trades slight artifacts for more speed; off reverts to original 15-step sequential. |
TTS_PROFILE |
1 |
Enable per-step profiling (talker/sample/codepred/embed breakdown). |
ABLATION |
rmsnorm, rope, baseline |
Revert specific optimizations for A/B testing. |
- Speculative decoding (Lite 0.6B as draft for Pro 1.7B): 2.7% acceptance rate. Both models share the same Code Predictor architecture, so the draft is only 1.58x faster — not enough for speculative decoding to be viable.
- Gamma scheme (2 serial + 13 parallel codebooks): Audio fully corrupt. Confirms 3 serial codebooks as the quality floor.
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- RAM: ~3GB for Lite models, ~6GB for Pro models
| Issue | Fix |
|---|---|
mlx_audio not found |
Run source .venv/bin/activate first |
Model not found |
Check model folder names match exactly |
| Audio won't play | Check macOS sound output settings |
- Qwen3-TTS - Original Qwen3-TTS by Alibaba
- MLX Audio - MLX framework for audio models
- MLX Community - Pre-converted MLX models
If this project helped you, please give it a ⭐ star!