_{_FastKoko}

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model

Multi-language support (English, Japanese, Korean, Chinese, Vietnamese)
OpenAI-compatible Speech endpoint, NVIDIA GPU accelerated or CPU inference with PyTorch
ONNX support coming soon, see v0.1.5 and earlier for legacy ONNX support in the interim
Debug endpoints for monitoring threads, storage, and session pools
Integrated web UI on localhost:8880/web
Phoneme-based audio generation, phoneme generation
(new) Per-word timestamped caption generation
(new) Voice mixing with weighted combinations

Get Started

Quickest Start (docker run)

Pre built images are available to run, with arm/multi-arch support, and baked in models Refer to the core/config.py file for a full list of variables which can be managed via the environment

docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.1.4 # CPU, or:
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.1.4 #NVIDIA GPU

Quick Start (docker compose)

Install prerequisites, and start the service using Docker Compose (Full setup including UI):

Install Docker

Clone the repository:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI

cd docker/gpu  # For GPU support
# or cd docker/cpu  # For CPU support
docker compose up --build

# Models will auto-download, but if needed you can manually download:
python docker/scripts/download_model.py --output api/src/models/v1_0

# Or run directly via UV:
./start-gpu.sh  # For GPU support
./start-cpu.sh  # For CPU support

Direct Run (via uv)

Install prerequisites ():

Install astral-uv

Clone the repository:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI

# if you are missing any models, run:
# python ../scripts/download_model.py --type pth  # for GPU
# python ../scripts/download_model.py --type onnx # for CPU

Start directly via UV (with hot-reload)

./start-cpu.sh OR
./start-gpu.sh

Up and Running?

Run locally as an OpenAI-Compatible Speech Endpoint

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8880/v1", api_key="not-needed"
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_sky+af_bella", #single or multiple voicepack combo
    input="Hello world!"
  ) as response:
      response.stream_to_file("output.mp3")

The API will be available at http://localhost:8880
API Documentation: http://localhost:8880/docs
Web Interface: http://localhost:8880/web

Features

OpenAI-Compatible Speech Endpoint

# Using OpenAI's Python library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
response = client.audio.speech.create(
    model="kokoro",  
    voice="af_bella+af_sky", # see /api/src/core/openai_mappings.json to customize
    input="Hello world!",
    response_format="mp3"
)

response.stream_to_file("output.mp3")

Or Via Requests:

import requests


response = requests.get("http://localhost:8880/v1/audio/voices")
voices = response.json()["voices"]

# Generate audio
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",  
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "mp3",  # Supported: mp3, wav, opus, flac
        "speed": 1.0
    }
)

# Save audio
with open("output.mp3", "wb") as f:
    f.write(response.content)

Quick tests (run from another terminal):

python examples/assorted_checks/test_openai/test_openai_tts.py # Test OpenAI Compatibility
python examples/assorted_checks/test_voices/test_all_voices.py # Test all available voices

Voice Combination

Weighted voice combinations using ratios (e.g., "af_bella(2)+af_heart(1)" for 67%/33% mix)
Ratios are automatically normalized to sum to 100%
Available through any endpoint by adding weights in parentheses
Saves generated voicepacks for future use

Combine voices and generate audio:

import requests
response = requests.get("http://localhost:8880/v1/audio/voices")
voices = response.json()["voices"]

# Example 1: Simple voice combination (50%/50% mix)
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella+af_sky",  # Equal weights
        "response_format": "mp3"
    }
)

# Example 2: Weighted voice combination (67%/33% mix)
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella(2)+af_sky(1)",  # 2:1 ratio = 67%/33%
        "response_format": "mp3"
    }
)

# Example 3: Download combined voice as .pt file
response = requests.post(
    "http://localhost:8880/v1/audio/voices/combine",
    json="af_bella(2)+af_sky(1)"  # 2:1 ratio = 67%/33%
)

# Save the .pt file
with open("combined_voice.pt", "wb") as f:
    f.write(response.content)

# Use the downloaded voice file
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "combined_voice",  # Use the saved voice file
        "response_format": "mp3"
    }
)

Multiple Output Audio Formats

mp3
wav
opus
flac
aac
pcm

Streaming Support

# OpenAI-compatible streaming
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8880/v1", api_key="not-needed")

# Stream to file
with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    input="Hello world!"
) as response:
    response.stream_to_file("output.mp3")

# Stream to speakers (requires PyAudio)
import pyaudio
player = pyaudio.PyAudio().open(
    format=pyaudio.paInt16, 
    channels=1, 
    rate=24000, 
    output=True
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    response_format="pcm",
    input="Hello world!"
) as response:
    for chunk in response.iter_bytes(chunk_size=1024):
        player.write(chunk)

Or via requests:

import requests

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "pcm"
    },
    stream=True
)

for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        # Process streaming chunks
        pass

Key Streaming Metrics:

First token latency @ chunksize
- ~300ms (GPU) @ 400
- ~3500ms (CPU) @ 200 (older i7)
- ~<1s (CPU) @ 200 (M3 Pro)
Adjustable chunking settings for real-time playback

Note: Artifacts in intonation can increase with smaller chunks

Processing Details

Performance Benchmarks

Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:

Windows 11 Home w/ WSL2
NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
WAV native output
H.G. Wells - The Time Machine (full text)

Key Performance Metrics:

Realtime Speed: Ranges between 35x-100x (generation time to output audio length)
Average Processing Rate: 137.67 tokens/second (cl100k_base)

GPU Vs. CPU

# GPU: Requires NVIDIA GPU with CUDA 12.1 support (~35x-100x realtime speed)
cd docker/gpu
docker compose up --build

# CPU: PyTorch CPU inference
cd docker/cpu
docker compose up --build

Note: Overall speed may have reduced somewhat with the structural changes to accomodate streaming. Looking into it

Natural Boundary Detection

Automatically splits and stitches at sentence boundaries
Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output

Timestamped Captions & Phonemes

Generate audio with word-level timestamps:

import requests
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "wav"
    }
)

# Get timestamps from header
timestamps = json.loads(response.headers['X-Word-Timestamps'])
print("Word-level timestamps:")
for ts in timestamps:
    print(f"{ts['word']}: {ts['start_time']:.3f}s - {ts['end_time']:.3f}s")

# Save audio
with open("output.wav", "wb") as f:
    f.write(response.content)

Phoneme & Token Routes

Convert text to phonemes and/or generate audio directly from phonemes:

import requests

def get_phonemes(text: str, language: str = "a"):
    """Get phonemes and tokens for input text"""
    response = requests.post(
        "http://localhost:8880/dev/phonemize",
        json={"text": text, "language": language}  # "a" for American English
    )
    response.raise_for_status()
    result = response.json()
    return result["phonemes"], result["tokens"]

def generate_audio_from_phonemes(phonemes: str, voice: str = "af_bella"):
    """Generate audio from phonemes"""
    response = requests.post(
        "http://localhost:8880/dev/generate_from_phonemes",
        json={"phonemes": phonemes, "voice": voice},
        headers={"Accept": "audio/wav"}
    )
    if response.status_code != 200:
        print(f"Error: {response.text}")
        return None
    return response.content

# Example usage
text = "Hello world!"
try:
    # Convert text to phonemes
    phonemes, tokens = get_phonemes(text)
    print(f"Phonemes: {phonemes}")  # e.g. ðɪs ɪz ˈoʊnli ɐ tˈɛst
    print(f"Tokens: {tokens}")      # Token IDs including start/end tokens

    # Generate and save audio
    if audio_bytes := generate_audio_from_phonemes(phonemes):
        with open("speech.wav", "wb") as f:
            f.write(audio_bytes)
        print(f"Generated {len(audio_bytes)} bytes of audio")
except Exception as e:
    print(f"Error: {e}")

See examples/phoneme_examples/generate_phonemes.py for a sample script.

Debug Endpoints

Monitor system state and resource usage with these endpoints:

/debug/threads - Get thread information and stack traces
/debug/storage - Monitor temp file and output directory usage
/debug/system - Get system information (CPU, memory, GPU)
/debug/session_pools - View ONNX session and CUDA stream status

Useful for debugging resource exhaustion or performance issues.

Known Issues

Versioning & Development

I'm doing what I can to keep things stable, but we are on an early and rapid set of build cycles here. If you run into trouble, you may have to roll back a version on the release tags if something comes up, or build up from source and/or troubleshoot + submit a PR. Will leave the branch up here for the last known stable points:

v0.0.5post1

Free and open source is a community effort, and I love working on this project, though there's only really so many hours in a day. If you'd like to support the work, feel free to open a PR, buy me a coffee, or report any bugs/features/etc you find during use.

Linux GPU Permissions

Some Linux users may encounter GPU permission issues when running as non-root. Can't guarantee anything, but here are some common solutions, consider your security requirements carefully

Option 1: Container Groups (Likely the best option)

services:
  kokoro-tts:
    # ... existing config ...
    group_add:
      - "video"
      - "render"

Option 2: Host System Groups

services:
  kokoro-tts:
    # ... existing config ...
    user: "${UID}:${GID}"
    group_add:
      - "video"

Note: May require adding host user to groups: sudo usermod -aG docker,video $USER and system restart.

Option 3: Device Permissions (Use with caution)

services:
  kokoro-tts:
    # ... existing config ...
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-uvm:/dev/nvidia-uvm

⚠️ Warning: Reduces system security. Use only in development environments.

Prerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.

Visit NVIDIA Container Toolkit installation for more detailed information

Model and License

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
.github		.github
api		api
assets		assets
docker		docker
docs		docs
examples		examples
ui		ui
web		web
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
.ruff.toml		.ruff.toml
CHANGELOG.md		CHANGELOG.md
README.md		README.md
VERSION		VERSION
debug.http		debug.http
docker-bake.hcl		docker-bake.hcl
githubbanner.png		githubbanner.png
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
slim.report.json		slim.report.json
start-cpu.sh		start-cpu.sh
start-gpu.sh		start-gpu.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

_{_FastKoko}

Get Started

Features

Processing Details

Known Issues

Option 1: Container Groups (Likely the best option)

Option 2: Host System Groups

Option 3: Device Permissions (Use with caution)

Model and License

About

Releases 9

Sponsor this project

Packages

Contributors 8

Languages

remsky/Kokoro-FastAPI

Folders and files

Latest commit

History

Repository files navigation

FastKoko

Get Started

Features

Processing Details

Known Issues

Option 1: Container Groups (Likely the best option)

Option 2: Host System Groups

Option 3: Device Permissions (Use with caution)

Model and License

About

Topics

Resources

Stars

Watchers

Forks

Releases 9

Sponsor this project

Packages 0

Contributors 8

Languages

_{_FastKoko}

Packages