The Myth of Q4_K_M
Everyone says Q4_K_M is the sweet spot: “4-bit, works great, barely any quality loss.” It’s the default recommendation, the safe choice. But here’s the truth: it’s optimal for most people’s hardware and most use cases. Not all.
If you’re always running Q4_K_M without testing alternatives, you’re leaving performance or quality on the table.
Understanding Quantization Notation
Quantization compresses a model by reducing the precision of its weights.
Q3_K_M = 3-bit quantization, medium variantQ4_K_M = 4-bit quantization, medium variant (most common)Q5_K_M = 5-bit quantization, medium variantQ6_K = 6-bit quantizationQ8 = 8-bit quantization (nearly full precision)F16 = 16-bit floating point (full precision)The difference is file size and VRAM usage:
Llama 2 7B model sizes:F16 (full): ~13 GBQ8: ~7 GBQ6_K: ~6 GBQ5_K_M: ~5 GBQ4_K_M: ~4 GBQ3_K_M: ~3 GBLower bit depth = smaller model = lower VRAM = faster inference (usually). But quality degrades.
When Q3 Actually Makes Sense
Q3 is extreme compression, and it shows. Hallucinations increase, reasoning gets flaky. Avoid it for tasks requiring nuance.
But if you’re running a small model for classification or simple extraction, Q3 is legitimately useful:
# Example: Q3 works fine for intent detectionimport requests
response = requests.post('http://localhost:11434/api/generate', json={ 'model': 'neural-chat:latest', 'prompt': 'Classify as "support", "sales", or "billing": "My invoice is wrong"', 'stream': False, 'num_predict': 20})
# Output: "billing" ✓# Q3 handles this well. No need for Q5.When Q3 is reasonable:
- Intent classification (support, sales, billing, etc.)
- Named entity extraction from structured text
- Simple category prediction
- High-volume inference on limited hardware
When Q3 fails:
- Long-form reasoning
- Code generation
- Creative writing
- Anything requiring coherent thought over 100+ tokens
When Q5 or Q6 Actually Wins
Q5 sits between Q4 and Q6. It’s a good middle ground if you have the VRAM.
# VRAM usage comparison# Running mistral 7B:# Q4_K_M: ~6.5 GB# Q5_K_M: ~7.5 GB# Q6_K: ~8.5 GB
# On a 16GB GPU, Q5_K_M + some headroom = solidQ5 is worth it if:
- You have the VRAM (12GB+)
- You’re doing code generation (quality matters more)
- You need long-context accuracy (hallucinations compound)
- You want one model for everything (jack of all trades)
Q6 is overkill for most self-hosted scenarios. You’re paying heavily in VRAM for barely-noticeable quality gains over Q5.
Testing on Your Hardware
Don’t guess. Benchmark.
#!/bin/bash# Test Q4 vs Q5 on your hardware
models=("mistral:7b-instruct-q4_k_m" "mistral:7b-instruct-q5_k_m")prompt="Explain how OAuth 2.0 works in 100 tokens."
for model in "${models[@]}"; do echo "Testing $model..." echo "Loading model..." start=$(date +%s%N)
# First request (includes model load) curl -s http://localhost:11434/api/generate \ -d "{\"model\": \"$model\", \"prompt\": \"$prompt\", \"num_predict\": 100}" | \ python3 -c "import sys, json; data=json.load(sys.stdin); print(f'First load: {data.get(\"load_duration\", 0)/1e9:.2f}s')"
# Subsequent requests (model stays loaded) for i in {1..3}; do curl -s http://localhost:11434/api/generate \ -d "{\"model\": \"$model\", \"prompt\": \"$prompt\", \"num_predict\": 100}" | \ python3 -c "import sys, json; data=json.load(sys.stdin); print(f'Request {$i}: {data.get(\"eval_duration\", 0)/1e9:.2f}s')" done echo ""doneWatch for:
- Load time (first request, includes model to VRAM)
- Eval time (how long to generate tokens)
- Quality (does output make sense?)
The Real Trade-off: Batch Size vs. Quantization
Here’s what nobody mentions: you can often get better throughput by staying at Q4 and increasing batch size, rather than dropping to Q3.
Scenario: You're serving API requests to 50 concurrent users
Option A: Q3, batch size 1- VRAM: 3GB- Throughput: 50 sequential requests (slow)
Option B: Q4_K_M, batch size 8- VRAM: 5GB (model) + 2GB (batch) = 7GB- Throughput: 8 requests in parallel (much faster)- Quality: Noticeably betterIf you have 8GB VRAM or more, batch inference at Q4 beats solo Q3 in almost every metric.
The Honest Recommendation
- Limited VRAM (4GB)? Q3 for simple tasks, Q4 for general use.
- Comfortable VRAM (8GB)? Q4_K_M, test Q5 if you care about quality.
- Plenty of VRAM (16GB+)? Q5_K_M is your default. Q6 only if you’re hitting quality issues.
- Running inference at scale? Don’t optimize quantization alone—optimize batch size + quantization together.
Test on your hardware with your workload. The benchmark that matters is yours, not someone’s blog post.
How to Check a Model’s Current Quantization
# Ollama: check model detailsollama show mistral:latest --modelfile | grep FROM
# llama.cpp: check file info./llama-cli --info --model ./mistral-7b-q4_k_m.gguf
# Just from the filename — Q4_K_M.gguf means Q4_K_M quantizationls -lah *.ggufThe GGUF filename usually tells you everything you need: mistral-7b-instruct-v0.2.Q4_K_M.gguf is a 7B Mistral model, instruction-tuned, Q4_K_M quantization.