Skip to content
SumGuy's Ramblings
Go back

Whisper & Faster-Whisper: Self-Hosted Speech-to-Text That Actually Works

Your Voice, Your Server, Nobody Else’s Business

Let’s be real — most of us have talked at our devices enough that some corporation probably has a pretty solid voice impression of us by now. Every “Hey Siri”, every dictated message, every Zoom auto-transcript has gone somewhere, processed by someone else’s GPU, logged in someone else’s database.

And if you’re running a home lab, building automations, transcribing meetings, or just have strong opinions about where your data goes, that’s annoying.

The good news: OpenAI dropped Whisper as open source back in 2022 and it is genuinely excellent. The better news: a project called Faster-Whisper took it, ran it through a quantization blender, and made it roughly 4x faster while cutting VRAM usage significantly. You can now run high-quality speech-to-text locally, in Docker, on hardware you already own, without sending a single audio byte to the cloud.

Let’s dig in.


What Even Is Whisper?

Whisper is a speech recognition model from OpenAI. It was trained on 680,000 hours of multilingual audio scraped from the internet, which is a frankly absurd amount of talking. The result is a model that handles accents, background noise, technical jargon, and multiple languages surprisingly well — often better than commercial APIs that cost money per minute.

OpenAI released it under the MIT license, which means you can run it, modify it, embed it, ship it, whatever.

Model Sizes: Pick Your Poison

Whisper comes in several sizes. Each one is a trade-off between accuracy, speed, and how much RAM/VRAM you’re willing to throw at it.

ModelParametersVRAM (approx)Speed (relative)Accuracy
tiny39M~1 GB~32x realtimeDecent for simple audio
base74M~1 GB~16x realtimeBetter, still fast
small244M~2 GB~6x realtimeGood general use
medium769M~5 GB~2x realtimeVery good
large-v21.5B~10 GB~1x realtimeExcellent
large-v31.5B~10 GB~1x realtimeBest accuracy

“Realtime” here means relative to audio duration. 32x realtime = 1 minute of audio transcribed in about 2 seconds. On a GPU. On CPU it’s a lot slower — which is exactly where Faster-Whisper saves your sanity.

For most use cases, small or medium hit the sweet spot. tiny and base are great if you’re running on a Raspberry Pi or genuinely don’t care about the occasional “I heard purple monkey dishwasher” moments. large-v3 is for when accuracy matters more than speed and you have a decent GPU.


Enter Faster-Whisper: Same Whisper, Less Waiting

Faster-Whisper is a reimplementation of Whisper using CTranslate2, a fast inference engine for transformer models. It also supports INT8 quantization, which means it shrinks the model weights into smaller numbers without destroying accuracy too badly.

The practical result:

It’s not magic — you do sacrifice a small amount of accuracy with INT8, but in most real-world audio it’s negligible. For anything short of courtroom transcription, you probably won’t notice.


Docker Setup: Getting Both Running

Original Whisper

There’s no single “official” Whisper Docker image, but this works well:

# docker-compose.yml for vanilla Whisper
version: "3.8"
services:
  whisper:
    image: onerahmet/openai-whisper-asr-webservice:latest
    ports:
      - "9000:9000"
    environment:
      - ASR_MODEL=small
      - ASR_ENGINE=openai_whisper
    volumes:
      - ./models:/root/.cache/whisper
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
docker compose up -d

This spins up a REST API on port 9000. First run downloads the model (a few hundred MB to a few GB depending on size), subsequent runs load from the volume cache.

Faster-Whisper

# docker-compose.yml for Faster-Whisper
version: "3.8"
services:
  faster-whisper:
    image: federicotorrielli/faster-whisper:latest
    ports:
      - "9001:8000"
    environment:
      - MODEL=medium
      - QUANTIZATION=int8
      - DEVICE=cpu  # or "cuda" for GPU
    volumes:
      - ./fw-models:/models
    restart: unless-stopped

Or if you want to build your own minimal container:

FROM python:3.11-slim

RUN pip install faster-whisper

COPY transcribe.py /app/transcribe.py
WORKDIR /app

ENTRYPOINT ["python", "transcribe.py"]

Python Usage: Actually Using the Thing

Vanilla Whisper

import whisper

# Load the model (downloads on first run)
model = whisper.load_model("small")

# Transcribe a file
result = model.transcribe("meeting_recording.mp3")
print(result["text"])

# With language hint (speeds things up if you know the language)
result = model.transcribe("podcast.mp3", language="en")

# Get word-level timestamps
result = model.transcribe("interview.wav", word_timestamps=True)
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s] {segment['text']}")

Faster-Whisper

from faster_whisper import WhisperModel

# Load with INT8 quantization on CPU
model = WhisperModel("medium", device="cpu", compute_type="int8")

# Or on GPU with float16
# model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("meeting_recording.mp3", beam_size=5)

print(f"Detected language: {info.language} ({info.language_probability:.0%} confidence)")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

The Faster-Whisper API returns a generator, so it streams segments as they’re processed rather than waiting for the whole file. Useful for longer recordings.

Batch Transcription

Got a folder full of recordings? Here’s a quick batch script:

from faster_whisper import WhisperModel
from pathlib import Path
import json

model = WhisperModel("small", device="cpu", compute_type="int8")

audio_dir = Path("./recordings")
output_dir = Path("./transcripts")
output_dir.mkdir(exist_ok=True)

for audio_file in audio_dir.glob("*.{mp3,wav,m4a,ogg,flac}"):
    print(f"Transcribing: {audio_file.name}")
    segments, info = model.transcribe(str(audio_file))
    
    transcript = {
        "file": audio_file.name,
        "language": info.language,
        "segments": [
            {
                "start": s.start,
                "end": s.end,
                "text": s.text.strip()
            }
            for s in segments
        ]
    }
    
    output_file = output_dir / f"{audio_file.stem}.json"
    output_file.write_text(json.dumps(transcript, indent=2))
    print(f"  Saved to {output_file}")

Run it overnight on your podcast archive. Wake up to a searchable text corpus. Feel smug.


OpenedAI-Speech: Drop-In OpenAI API Compatibility

Here’s a fun one. OpenedAI-Speech is a project that wraps Whisper (and Faster-Whisper) behind an OpenAI-compatible REST API. That means any tool that already speaks to OpenAI’s transcription endpoint can be pointed at your local server instead. Zero code changes on the client side.

# docker-compose.yml
version: "3.8"
services:
  openedai-speech:
    image: ghcr.io/matatonic/openedai-speech:latest
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models
    environment:
      - WHISPER_MODEL=faster-whisper/medium
    restart: unless-stopped

Then in any code that uses the OpenAI Python SDK:

from openai import OpenAI

# Point it at your local server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed-but-required-by-sdk"
)

with open("recording.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print(transcript.text)

This is extremely useful if you’re integrating with tools like n8n, home automation workflows, or any app that supports OpenAI audio transcription. Point the URL somewhere else, keep everything else identical.


Speed & Accuracy: The Actual Comparison

Here’s a practical comparison running medium model on a mix of hardware, transcribing a 10-minute audio file:

SetupTimeVRAM UsedCPU RAMAccuracy (WER)
Whisper medium, GPU (RTX 3060)~45s~5.5 GB~2 GB~8% WER
Whisper medium, CPU only~12 min~4 GB~8% WER
Faster-Whisper medium, GPU (fp16)~12s~3 GB~1.5 GB~8.5% WER
Faster-Whisper medium, CPU (int8)~5 min~2 GB~9% WER
Faster-Whisper large-v3, GPU (fp16)~30s~5.5 GB~2 GB~6% WER

WER = Word Error Rate. Lower is better. The int8 quantization adds roughly 0.5-1% WER in practice — completely imperceptible unless you’re doing precision transcription work.

The headline: Faster-Whisper medium on GPU is ~4x faster than vanilla Whisper and uses significantly less VRAM. On CPU, it’s roughly 2-2.5x faster. For CPU-only setups, this is a big deal — it’s the difference between “this is actually usable” and “I’ll go make a sandwich and check back in 20 minutes.”


What Would You Actually Use This For?

Glad you asked. Here’s where self-hosted STT earns its keep:

Meeting transcription — Record your Zoom/Teams calls locally, run them through Faster-Whisper overnight, get searchable text. No cloud, no subscription, no “your free tier ran out” emails.

Podcast indexing — Got a collection of podcasts you want to search? Batch transcribe the whole thing, throw the output into a SQLite database, search with full text. This took me an afternoon to set up and it’s genuinely one of the more useful things in my homelab.

Home automation — Pair with a local wake word detector and you’ve got a fully offline voice assistant pipeline. Faster-Whisper on a capable mini PC can handle short commands in 1-2 seconds even on CPU.

Content creation — Transcribe your own recordings, videos, voice memos. Export to whatever format you need. No manual transcription service, no subscription.

Subtitles — Faster-Whisper’s segment timestamps are good enough to generate SRT files directly. There are wrapper tools that do exactly this.


Picking Your Setup


Speech-to-text used to be one of those capabilities where you grudgingly accepted cloud dependence because the local alternatives were garbage. Whisper changed that. Faster-Whisper made it genuinely practical even on modest hardware.

Your audio, your server, your transcripts. That’s the whole pitch.

Now go transcribe something. You’ve been meaning to index those old meeting recordings for six months anyway.


Share this post on:

Previous Post
MinIO vs SeaweedFS: Self-Hosted S3 Storage Without AWS Bills
Next Post
UFW Advanced: Rate Limiting, Logging, and Rules That Actually Make Sense