Piper vs Coqui: Text-to-Speech on Your Own Hardware (Because AWS Polly Charges Per Character Like It's 1999 SMS)

Your House Shouldn’t Have to Phone Home to Talk Back to You

So you’ve got Home Assistant running. You’ve got automations that tell you when the laundry is done, when someone rings the doorbell, or when the server room hits 40°C and it’s time to panic. You want those automations to speak to you — an actual voice, not a push notification you’ll ignore for six hours.

You google “text-to-speech for Home Assistant” and end up staring at a list of cloud options: Google Cloud TTS, Amazon Polly, Azure Cognitive Services. They all work great. They also all want your words, your API key, your credit card, and eventually your firstborn.

Amazon Polly, bless its heart, charges you per character. Per character. Like we’re back in the era of SMS overage fees and Nokia snakes. You want your house to say “The washing machine is done” and suddenly you’re doing math on whether full stops cost extra.

Here’s the good news: you don’t need any of that. Piper and Coqui are two excellent local TTS engines you can run entirely on your own hardware, completely offline, with Docker, and without a billing dashboard giving you anxiety.

Let’s break them both down.

Piper TTS: The Featherweight Champion

Piper is a fast, lightweight, offline neural TTS system developed by the folks at Nabu Casa (the people behind Home Assistant’s cloud service, ironically). It’s written in C++, uses ONNX models, and runs startlingly well on low-power hardware.

Why Piper is great:

Runs fast on a Raspberry Pi 4. Like, actually fast. Not “go make a coffee” fast — real-time fast.
Models are small. Typical voices are 40–130MB depending on quality. Your 8GB Pi is not crying.
ONNX runtime means no Python dependency hell. It just works.
Large model library: English, German, French, Spanish, Chinese, and many more. Multiple voice styles per language.
Single binary + model file = genuinely simple deployment.
First-class Home Assistant support via the Wyoming protocol.

What Piper is not:

Piper isn’t trying to fool anyone into thinking they’re talking to a human. The voices are clean and natural, but they’re neural TTS voices — not voice clones of your nan. It’s expressive enough for home automation announcements, audiobook reading, or accessibility tools. It’s not trying to win an Oscar.

Running Piper with Docker

The official Piper Docker image is tiny and painless:

docker run --rm \
  -v /path/to/models:/data/models \
  rhasspy/wyoming-piper \
  --voice en_US-lessac-medium

That’s it. That’s the whole thing. It exposes a Wyoming protocol endpoint that Home Assistant can talk to natively.

If you want to use Piper directly from the command line:

docker run --rm -i \
  -v /path/to/models:/data/models \
  rhasspy/wyoming-piper \
  --voice en_US-lessac-medium \
  --output-raw \
  | aplay -r 22050 -f S16_LE -c 1

Pipe in text, get audio out. Absurdly simple.

Docker Compose for Piper

services:
  piper:
    image: rhasspy/wyoming-piper
    restart: unless-stopped
    ports:
      - "10200:10200"
    volumes:
      - ./piper-models:/data/models
    command: --voice en_US-lessac-medium --uri tcp://0.0.0.0:10200

Home Assistant Integration

In Home Assistant, go to Settings > Devices & Services > Add Integration and search for Wyoming Protocol. Point it at your Piper container’s IP and port 10200. Done. You now have local TTS that works when your internet is down, when AWS is having a bad day, or when you just don’t want Amazon knowing your washing machine schedule.

Coqui TTS: The Expressive One With More Opinions

Coqui TTS is a different beast. It’s Python-based, heavier, slower, and significantly more capable in terms of voice quality and flexibility. Coqui supports multiple model architectures including VCTK, YourTTS, and the impressive XTTS v2 — which supports voice cloning from a short audio clip.

Why Coqui is great:

More expressive, natural-sounding voices than Piper
XTTS v2 supports voice cloning: give it a 5-second audio sample, it’ll synthesize in that voice
Multi-lingual support including cross-lingual cloning (clone in English, speak in French)
Active model zoo with dozens of pre-trained voices
HTTP API mode for easy integration

What Coqui is not:

Fast. Or light. Running XTTS v2 without a GPU is a “go make that coffee” situation. On CPU, you’re looking at 10–30 seconds for a paragraph. On a Raspberry Pi, you’re measuring in geologic time. Coqui is best suited for a beefy NUC, a homelab server, or anything with a GPU gathering dust.

Also worth noting: Coqui TTS (the company) shut down in 2023. The open-source project continues under the community fork coqui-ai/TTS, but the original SaaS is gone. The software lives on; just be aware of where you’re pulling from.

Running Coqui with Docker

docker run --rm -it \
  -p 5002:5002 \
  --entrypoint python3 \
  ghcr.io/coqui-ai/tts \
  TTS/server/server.py \
  --model_name tts_models/en/vctk/vits \
  --use_cuda false

This starts a web server with a built-in UI and REST API at http://localhost:5002.

Docker Compose for Coqui

services:
  coqui-tts:
    image: ghcr.io/coqui-ai/tts
    restart: unless-stopped
    ports:
      - "5002:5002"
    volumes:
      - ./tts-models:/root/.local/share/tts
    entrypoint: python3
    command: >
      TTS/server/server.py
      --model_name tts_models/en/vctk/vits
      --use_cuda false

Using the API

Once it’s running, you can hit the API directly:

curl -G "http://localhost:5002/api/tts" \
  --data-urlencode "text=Hello from my homelab" \
  --data-urlencode "speaker_id=p230" \
  -o output.wav

The VCTK model has 100+ speaker IDs. Some are better than others. p230 is solid. p267 is good for narration. You’ll spend 20 minutes listening to them all — this is normal and expected.

OpenedAI-Speech: The Drop-In OpenAI API Wrapper

If you’re using tools or apps that already know how to talk to OpenAI’s TTS API, there’s a project called openedai-speech that wraps either Piper or Coqui in an OpenAI-compatible /v1/audio/speech endpoint. Drop it in front of your local TTS and suddenly any OpenAI client just works, pointed at your server instead of theirs.

services:
  openedai-speech:
    image: ghcr.io/matatonic/openedai-speech
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      - ./voices:/app/voices
      - ./config:/app/config
    environment:
      - TTS_HOME=/app/voices

This is particularly useful if you’re running local LLMs and want the full stack local — chat, reasoning, and speech output — without a single request leaving your network.

Comparison Table

Feature	Piper	Coqui TTS
Language	C++	Python
Speed (CPU)	Very fast (real-time on Pi 4)	Slow to moderate
RAM Usage	~100–300MB	1–4GB+
GPU Required	No	No, but recommended for XTTS
Voice Quality	Good — clean neural TTS	Better — more natural
Voice Cloning	No	Yes (XTTS v2)
Home Assistant	Native Wyoming support	Via API or custom component
Model Size	40–130MB per voice	1–6GB per model
Docker Image Size	Small (~200MB)	Large (~3–5GB)
Maintenance	Active (Nabu Casa)	Community fork
Best For	Pi, HA, low-power devices	Homelab server, expressive voices

Use Cases: Which One Should You Actually Use

Use Piper if:

You’re running Home Assistant and want local TTS announcements
You’re on a Raspberry Pi or other low-power hardware
You want something that just works with minimal configuration
Response time matters (smart home, real-time assistants)
You don’t need voice cloning or highly expressive synthesis

Use Coqui if:

You’re generating longer-form audio (audiobooks, podcast drafts, accessibility content)
You want expressive, natural-sounding voices
You have a server with enough RAM and ideally a GPU
You want to experiment with voice cloning
Response latency isn’t critical

Use both if:

You’re the kind of person reading a self-hosting blog on a Wednesday evening and you think “why not” is a complete sentence. Run Piper for fast real-time stuff, Coqui for anything where quality beats speed. Use openedai-speech to abstract the difference away. Call it an architecture.

Privacy: The Actual Point

Let’s circle back to why this matters beyond the nerd satisfaction of running your own stack.

When you use cloud TTS, every string of text you convert to audio goes to a third-party server. Your home automation logs, your accessibility reads, your drafts, your messages — all of it. Most providers say they don’t store it. Most providers also have terms of service that are 47 pages long and were last written by a lawyer whose primary goal was not your comfort.

Local TTS means your text stays on your hardware. Full stop. When your house says “Motion detected in the back garden at 2am,” that sentence doesn’t travel to a data center in Virginia first.

For most people, this probably doesn’t matter most of the time. But it’s your data, it’s your house, and running local TTS is not actually hard anymore. So why wouldn’t you?

Voice Model Tips

For Piper, the model naming convention tells you everything you need: en_US-lessac-medium breaks down as language (en_US), speaker name (lessac), and quality tier (low, medium, high). Higher quality tiers sound better but use more CPU. Start with medium — it’s the sweet spot for most hardware. The full model list lives at rhasspy/piper-voices on GitHub and there are hundreds of them.

For Coqui, the first time you specify a model name in the API or CLI, it downloads it automatically to ~/.local/share/tts. This is convenient but can catch you off-guard on a server with limited disk space. XTTS v2 in particular is around 2GB. Set a volume mount in your Docker compose so those downloads survive container restarts and don’t vanish when you pull an updated image.

One practical tip: both engines can output to a file rather than playing audio directly. That means you can pre-generate common phrases — “Good morning”, “Front door opened”, “Server room temperature critical” — and serve them as static audio files. Zero inference latency for your most common announcements, and your Raspberry Pi barely notices.

Getting Started Checklist

Decide: low-power/HA (Piper) or high-quality/server (Coqui)
Pull the relevant Docker image
Download a voice model (Piper model library is at rhasspy/piper-voices, Coqui downloads on first run)
Test with a simple curl or CLI command
If using Home Assistant, add Wyoming integration (Piper) or configure a custom TTS component (Coqui)
If you want OpenAI API compatibility, drop openedai-speech in front of it
Never pay per character again

That’s genuinely it. Both projects have decent documentation, active communities, and work well in Docker. The barrier here is lower than you think.

Your house can talk. It doesn’t need to call home to do it.

Filed under: self-hosting, home assistant, tts, docker, linux, ai

Piper vs Coqui: Text-to-Speech on Your Own Hardware (Because AWS Polly Charges Per Character Like It's 1999 SMS)

Your House Shouldn’t Have to Phone Home to Talk Back to You

Piper TTS: The Featherweight Champion

Running Piper with Docker

Docker Compose for Piper

Home Assistant Integration

Coqui TTS: The Expressive One With More Opinions

Running Coqui with Docker

Docker Compose for Coqui

Using the API

OpenedAI-Speech: The Drop-In OpenAI API Wrapper

Comparison Table

Use Cases: Which One Should You Actually Use

Privacy: The Actual Point

Voice Model Tips

Getting Started Checklist

Responses from around the web

Discussion

Related Posts

Beyond RAG: When a Virtual Filesystem Works Better

AMD Lemonade: Local LLM Serving for AMD GPUs

RAG on a Budget: Building a Knowledge Base with Ollama & ChromaDB

Stable Diffusion vs ComfyUI vs Fooocus: AI Image Generation at Home