Skip to content
SumGuy's Ramblings
Go back

Piper vs Coqui: Text-to-Speech on Your Own Hardware (Because AWS Polly Charges Per Character Like It's 1999 SMS)

Your House Shouldn’t Have to Phone Home to Talk Back to You

So you’ve got Home Assistant running. You’ve got automations that tell you when the laundry is done, when someone rings the doorbell, or when the server room hits 40°C and it’s time to panic. You want those automations to speak to you — an actual voice, not a push notification you’ll ignore for six hours.

You google “text-to-speech for Home Assistant” and end up staring at a list of cloud options: Google Cloud TTS, Amazon Polly, Azure Cognitive Services. They all work great. They also all want your words, your API key, your credit card, and eventually your firstborn.

Amazon Polly, bless its heart, charges you per character. Per character. Like we’re back in the era of SMS overage fees and Nokia snakes. You want your house to say “The washing machine is done” and suddenly you’re doing math on whether full stops cost extra.

Here’s the good news: you don’t need any of that. Piper and Coqui are two excellent local TTS engines you can run entirely on your own hardware, completely offline, with Docker, and without a billing dashboard giving you anxiety.

Let’s break them both down.


Piper TTS: The Featherweight Champion

Piper is a fast, lightweight, offline neural TTS system developed by the folks at Nabu Casa (the people behind Home Assistant’s cloud service, ironically). It’s written in C++, uses ONNX models, and runs startlingly well on low-power hardware.

Why Piper is great:

What Piper is not:

Piper isn’t trying to fool anyone into thinking they’re talking to a human. The voices are clean and natural, but they’re neural TTS voices — not voice clones of your nan. It’s expressive enough for home automation announcements, audiobook reading, or accessibility tools. It’s not trying to win an Oscar.

Running Piper with Docker

The official Piper Docker image is tiny and painless:

docker run --rm \
  -v /path/to/models:/data/models \
  rhasspy/wyoming-piper \
  --voice en_US-lessac-medium

That’s it. That’s the whole thing. It exposes a Wyoming protocol endpoint that Home Assistant can talk to natively.

If you want to use Piper directly from the command line:

docker run --rm -i \
  -v /path/to/models:/data/models \
  rhasspy/wyoming-piper \
  --voice en_US-lessac-medium \
  --output-raw \
  | aplay -r 22050 -f S16_LE -c 1

Pipe in text, get audio out. Absurdly simple.

Docker Compose for Piper

services:
  piper:
    image: rhasspy/wyoming-piper
    restart: unless-stopped
    ports:
      - "10200:10200"
    volumes:
      - ./piper-models:/data/models
    command: --voice en_US-lessac-medium --uri tcp://0.0.0.0:10200

Home Assistant Integration

In Home Assistant, go to Settings > Devices & Services > Add Integration and search for Wyoming Protocol. Point it at your Piper container’s IP and port 10200. Done. You now have local TTS that works when your internet is down, when AWS is having a bad day, or when you just don’t want Amazon knowing your washing machine schedule.


Coqui TTS: The Expressive One With More Opinions

Coqui TTS is a different beast. It’s Python-based, heavier, slower, and significantly more capable in terms of voice quality and flexibility. Coqui supports multiple model architectures including VCTK, YourTTS, and the impressive XTTS v2 — which supports voice cloning from a short audio clip.

Why Coqui is great:

What Coqui is not:

Fast. Or light. Running XTTS v2 without a GPU is a “go make that coffee” situation. On CPU, you’re looking at 10–30 seconds for a paragraph. On a Raspberry Pi, you’re measuring in geologic time. Coqui is best suited for a beefy NUC, a homelab server, or anything with a GPU gathering dust.

Also worth noting: Coqui TTS (the company) shut down in 2023. The open-source project continues under the community fork coqui-ai/TTS, but the original SaaS is gone. The software lives on; just be aware of where you’re pulling from.

Running Coqui with Docker

docker run --rm -it \
  -p 5002:5002 \
  --entrypoint python3 \
  ghcr.io/coqui-ai/tts \
  TTS/server/server.py \
  --model_name tts_models/en/vctk/vits \
  --use_cuda false

This starts a web server with a built-in UI and REST API at http://localhost:5002.

Docker Compose for Coqui

services:
  coqui-tts:
    image: ghcr.io/coqui-ai/tts
    restart: unless-stopped
    ports:
      - "5002:5002"
    volumes:
      - ./tts-models:/root/.local/share/tts
    entrypoint: python3
    command: >
      TTS/server/server.py
      --model_name tts_models/en/vctk/vits
      --use_cuda false

Using the API

Once it’s running, you can hit the API directly:

curl -G "http://localhost:5002/api/tts" \
  --data-urlencode "text=Hello from my homelab" \
  --data-urlencode "speaker_id=p230" \
  -o output.wav

The VCTK model has 100+ speaker IDs. Some are better than others. p230 is solid. p267 is good for narration. You’ll spend 20 minutes listening to them all — this is normal and expected.


OpenedAI-Speech: The Drop-In OpenAI API Wrapper

If you’re using tools or apps that already know how to talk to OpenAI’s TTS API, there’s a project called openedai-speech that wraps either Piper or Coqui in an OpenAI-compatible /v1/audio/speech endpoint. Drop it in front of your local TTS and suddenly any OpenAI client just works, pointed at your server instead of theirs.

services:
  openedai-speech:
    image: ghcr.io/matatonic/openedai-speech
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      - ./voices:/app/voices
      - ./config:/app/config
    environment:
      - TTS_HOME=/app/voices

This is particularly useful if you’re running local LLMs and want the full stack local — chat, reasoning, and speech output — without a single request leaving your network.


Comparison Table

FeaturePiperCoqui TTS
LanguageC++Python
Speed (CPU)Very fast (real-time on Pi 4)Slow to moderate
RAM Usage~100–300MB1–4GB+
GPU RequiredNoNo, but recommended for XTTS
Voice QualityGood — clean neural TTSBetter — more natural
Voice CloningNoYes (XTTS v2)
Home AssistantNative Wyoming supportVia API or custom component
Model Size40–130MB per voice1–6GB per model
Docker Image SizeSmall (~200MB)Large (~3–5GB)
MaintenanceActive (Nabu Casa)Community fork
Best ForPi, HA, low-power devicesHomelab server, expressive voices

Use Cases: Which One Should You Actually Use

Use Piper if:

Use Coqui if:

Use both if:

You’re the kind of person reading a self-hosting blog on a Wednesday evening and you think “why not” is a complete sentence. Run Piper for fast real-time stuff, Coqui for anything where quality beats speed. Use openedai-speech to abstract the difference away. Call it an architecture.


Privacy: The Actual Point

Let’s circle back to why this matters beyond the nerd satisfaction of running your own stack.

When you use cloud TTS, every string of text you convert to audio goes to a third-party server. Your home automation logs, your accessibility reads, your drafts, your messages — all of it. Most providers say they don’t store it. Most providers also have terms of service that are 47 pages long and were last written by a lawyer whose primary goal was not your comfort.

Local TTS means your text stays on your hardware. Full stop. When your house says “Motion detected in the back garden at 2am,” that sentence doesn’t travel to a data center in Virginia first.

For most people, this probably doesn’t matter most of the time. But it’s your data, it’s your house, and running local TTS is not actually hard anymore. So why wouldn’t you?


Voice Model Tips

For Piper, the model naming convention tells you everything you need: en_US-lessac-medium breaks down as language (en_US), speaker name (lessac), and quality tier (low, medium, high). Higher quality tiers sound better but use more CPU. Start with medium — it’s the sweet spot for most hardware. The full model list lives at rhasspy/piper-voices on GitHub and there are hundreds of them.

For Coqui, the first time you specify a model name in the API or CLI, it downloads it automatically to ~/.local/share/tts. This is convenient but can catch you off-guard on a server with limited disk space. XTTS v2 in particular is around 2GB. Set a volume mount in your Docker compose so those downloads survive container restarts and don’t vanish when you pull an updated image.

One practical tip: both engines can output to a file rather than playing audio directly. That means you can pre-generate common phrases — “Good morning”, “Front door opened”, “Server room temperature critical” — and serve them as static audio files. Zero inference latency for your most common announcements, and your Raspberry Pi barely notices.


Getting Started Checklist

  1. Decide: low-power/HA (Piper) or high-quality/server (Coqui)
  2. Pull the relevant Docker image
  3. Download a voice model (Piper model library is at rhasspy/piper-voices, Coqui downloads on first run)
  4. Test with a simple curl or CLI command
  5. If using Home Assistant, add Wyoming integration (Piper) or configure a custom TTS component (Coqui)
  6. If you want OpenAI API compatibility, drop openedai-speech in front of it
  7. Never pay per character again

That’s genuinely it. Both projects have decent documentation, active communities, and work well in Docker. The barrier here is lower than you think.

Your house can talk. It doesn’t need to call home to do it.


Filed under: self-hosting, home assistant, tts, docker, linux, ai


Share this post on:

Previous Post
Pi-hole vs AdGuard Home: Block Ads for Every Device on Your Network
Next Post
LangChain vs LlamaIndex: When Your AI Needs to Talk to Your Data