Your Voice, Your Server, Nobody Else’s Business
Let’s be real — most of us have talked at our devices enough that some corporation probably has a pretty solid voice impression of us by now. Every “Hey Siri”, every dictated message, every Zoom auto-transcript has gone somewhere, processed by someone else’s GPU, logged in someone else’s database.
And if you’re running a home lab, building automations, transcribing meetings, or just have strong opinions about where your data goes, that’s annoying.
The good news: OpenAI dropped Whisper as open source back in 2022 and it is genuinely excellent. The better news: a project called Faster-Whisper took it, ran it through a quantization blender, and made it roughly 4x faster while cutting VRAM usage significantly. You can now run high-quality speech-to-text locally, in Docker, on hardware you already own, without sending a single audio byte to the cloud.
Let’s dig in.
What Even Is Whisper?
Whisper is a speech recognition model from OpenAI. It was trained on 680,000 hours of multilingual audio scraped from the internet, which is a frankly absurd amount of talking. The result is a model that handles accents, background noise, technical jargon, and multiple languages surprisingly well — often better than commercial APIs that cost money per minute.
OpenAI released it under the MIT license, which means you can run it, modify it, embed it, ship it, whatever.
Model Sizes: Pick Your Poison
Whisper comes in several sizes. Each one is a trade-off between accuracy, speed, and how much RAM/VRAM you’re willing to throw at it.
| Model | Parameters | VRAM (approx) | Speed (relative) | Accuracy |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~32x realtime | Decent for simple audio |
| base | 74M | ~1 GB | ~16x realtime | Better, still fast |
| small | 244M | ~2 GB | ~6x realtime | Good general use |
| medium | 769M | ~5 GB | ~2x realtime | Very good |
| large-v2 | 1.5B | ~10 GB | ~1x realtime | Excellent |
| large-v3 | 1.5B | ~10 GB | ~1x realtime | Best accuracy |
“Realtime” here means relative to audio duration. 32x realtime = 1 minute of audio transcribed in about 2 seconds. On a GPU. On CPU it’s a lot slower — which is exactly where Faster-Whisper saves your sanity.
For most use cases, small or medium hit the sweet spot. tiny and base are great if you’re running on a Raspberry Pi or genuinely don’t care about the occasional “I heard purple monkey dishwasher” moments. large-v3 is for when accuracy matters more than speed and you have a decent GPU.
Enter Faster-Whisper: Same Whisper, Less Waiting
Faster-Whisper is a reimplementation of Whisper using CTranslate2, a fast inference engine for transformer models. It also supports INT8 quantization, which means it shrinks the model weights into smaller numbers without destroying accuracy too badly.
The practical result:
- ~4x faster than original Whisper on GPU
- ~2x faster on CPU
- Significantly less VRAM — run
large-v3on a 6GB GPU without crying - Lower memory footprint overall
It’s not magic — you do sacrifice a small amount of accuracy with INT8, but in most real-world audio it’s negligible. For anything short of courtroom transcription, you probably won’t notice.
Docker Setup: Getting Both Running
Original Whisper
There’s no single “official” Whisper Docker image, but this works well:
# docker-compose.yml for vanilla Whisper
version: "3.8"
services:
whisper:
image: onerahmet/openai-whisper-asr-webservice:latest
ports:
- "9000:9000"
environment:
- ASR_MODEL=small
- ASR_ENGINE=openai_whisper
volumes:
- ./models:/root/.cache/whisper
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
docker compose up -d
This spins up a REST API on port 9000. First run downloads the model (a few hundred MB to a few GB depending on size), subsequent runs load from the volume cache.
Faster-Whisper
# docker-compose.yml for Faster-Whisper
version: "3.8"
services:
faster-whisper:
image: federicotorrielli/faster-whisper:latest
ports:
- "9001:8000"
environment:
- MODEL=medium
- QUANTIZATION=int8
- DEVICE=cpu # or "cuda" for GPU
volumes:
- ./fw-models:/models
restart: unless-stopped
Or if you want to build your own minimal container:
FROM python:3.11-slim
RUN pip install faster-whisper
COPY transcribe.py /app/transcribe.py
WORKDIR /app
ENTRYPOINT ["python", "transcribe.py"]
Python Usage: Actually Using the Thing
Vanilla Whisper
import whisper
# Load the model (downloads on first run)
model = whisper.load_model("small")
# Transcribe a file
result = model.transcribe("meeting_recording.mp3")
print(result["text"])
# With language hint (speeds things up if you know the language)
result = model.transcribe("podcast.mp3", language="en")
# Get word-level timestamps
result = model.transcribe("interview.wav", word_timestamps=True)
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s] {segment['text']}")
Faster-Whisper
from faster_whisper import WhisperModel
# Load with INT8 quantization on CPU
model = WhisperModel("medium", device="cpu", compute_type="int8")
# Or on GPU with float16
# model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("meeting_recording.mp3", beam_size=5)
print(f"Detected language: {info.language} ({info.language_probability:.0%} confidence)")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
The Faster-Whisper API returns a generator, so it streams segments as they’re processed rather than waiting for the whole file. Useful for longer recordings.
Batch Transcription
Got a folder full of recordings? Here’s a quick batch script:
from faster_whisper import WhisperModel
from pathlib import Path
import json
model = WhisperModel("small", device="cpu", compute_type="int8")
audio_dir = Path("./recordings")
output_dir = Path("./transcripts")
output_dir.mkdir(exist_ok=True)
for audio_file in audio_dir.glob("*.{mp3,wav,m4a,ogg,flac}"):
print(f"Transcribing: {audio_file.name}")
segments, info = model.transcribe(str(audio_file))
transcript = {
"file": audio_file.name,
"language": info.language,
"segments": [
{
"start": s.start,
"end": s.end,
"text": s.text.strip()
}
for s in segments
]
}
output_file = output_dir / f"{audio_file.stem}.json"
output_file.write_text(json.dumps(transcript, indent=2))
print(f" Saved to {output_file}")
Run it overnight on your podcast archive. Wake up to a searchable text corpus. Feel smug.
OpenedAI-Speech: Drop-In OpenAI API Compatibility
Here’s a fun one. OpenedAI-Speech is a project that wraps Whisper (and Faster-Whisper) behind an OpenAI-compatible REST API. That means any tool that already speaks to OpenAI’s transcription endpoint can be pointed at your local server instead. Zero code changes on the client side.
# docker-compose.yml
version: "3.8"
services:
openedai-speech:
image: ghcr.io/matatonic/openedai-speech:latest
ports:
- "8000:8000"
volumes:
- ./models:/app/models
environment:
- WHISPER_MODEL=faster-whisper/medium
restart: unless-stopped
Then in any code that uses the OpenAI Python SDK:
from openai import OpenAI
# Point it at your local server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed-but-required-by-sdk"
)
with open("recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcript.text)
This is extremely useful if you’re integrating with tools like n8n, home automation workflows, or any app that supports OpenAI audio transcription. Point the URL somewhere else, keep everything else identical.
Speed & Accuracy: The Actual Comparison
Here’s a practical comparison running medium model on a mix of hardware, transcribing a 10-minute audio file:
| Setup | Time | VRAM Used | CPU RAM | Accuracy (WER) |
|---|---|---|---|---|
| Whisper medium, GPU (RTX 3060) | ~45s | ~5.5 GB | ~2 GB | ~8% WER |
| Whisper medium, CPU only | ~12 min | — | ~4 GB | ~8% WER |
| Faster-Whisper medium, GPU (fp16) | ~12s | ~3 GB | ~1.5 GB | ~8.5% WER |
| Faster-Whisper medium, CPU (int8) | ~5 min | — | ~2 GB | ~9% WER |
| Faster-Whisper large-v3, GPU (fp16) | ~30s | ~5.5 GB | ~2 GB | ~6% WER |
WER = Word Error Rate. Lower is better. The int8 quantization adds roughly 0.5-1% WER in practice — completely imperceptible unless you’re doing precision transcription work.
The headline: Faster-Whisper medium on GPU is ~4x faster than vanilla Whisper and uses significantly less VRAM. On CPU, it’s roughly 2-2.5x faster. For CPU-only setups, this is a big deal — it’s the difference between “this is actually usable” and “I’ll go make a sandwich and check back in 20 minutes.”
What Would You Actually Use This For?
Glad you asked. Here’s where self-hosted STT earns its keep:
Meeting transcription — Record your Zoom/Teams calls locally, run them through Faster-Whisper overnight, get searchable text. No cloud, no subscription, no “your free tier ran out” emails.
Podcast indexing — Got a collection of podcasts you want to search? Batch transcribe the whole thing, throw the output into a SQLite database, search with full text. This took me an afternoon to set up and it’s genuinely one of the more useful things in my homelab.
Home automation — Pair with a local wake word detector and you’ve got a fully offline voice assistant pipeline. Faster-Whisper on a capable mini PC can handle short commands in 1-2 seconds even on CPU.
Content creation — Transcribe your own recordings, videos, voice memos. Export to whatever format you need. No manual transcription service, no subscription.
Subtitles — Faster-Whisper’s segment timestamps are good enough to generate SRT files directly. There are wrapper tools that do exactly this.
Picking Your Setup
- Just want to try it fast: Use the
onerahmet/openai-whisper-asr-webserviceDocker image withsmallmodel. Up in 5 minutes. - CPU-only server: Faster-Whisper with
int8quantization andmediummodel. Best speed/accuracy trade-off without a GPU. - GPU available (6GB+ VRAM): Faster-Whisper
large-v3withfloat16. Best accuracy you can get locally. - Need OpenAI API compatibility: OpenedAI-Speech with Faster-Whisper backend. Drop-in for anything already using OpenAI’s API.
- Tiny/embedded device:
tinyorbasemodel, expect some errors, manage expectations accordingly.
Speech-to-text used to be one of those capabilities where you grudgingly accepted cloud dependence because the local alternatives were garbage. Whisper changed that. Faster-Whisper made it genuinely practical even on modest hardware.
Your audio, your server, your transcripts. That’s the whole pitch.
Now go transcribe something. You’ve been meaning to index those old meeting recordings for six months anyway.