Skip to content
Go back

LLM Quantization: Q4_K_M Isn't Always the Best Choice

By SumGuy 4 min read
LLM Quantization: Q4_K_M Isn't Always the Best Choice

The Myth of Q4_K_M

Everyone says Q4_K_M is the sweet spot: “4-bit, works great, barely any quality loss.” It’s the default recommendation, the safe choice. But here’s the truth: it’s optimal for most people’s hardware and most use cases. Not all.

If you’re always running Q4_K_M without testing alternatives, you’re leaving performance or quality on the table.

Understanding Quantization Notation

Quantization compresses a model by reducing the precision of its weights.

Q3_K_M = 3-bit quantization, medium variant
Q4_K_M = 4-bit quantization, medium variant (most common)
Q5_K_M = 5-bit quantization, medium variant
Q6_K = 6-bit quantization
Q8 = 8-bit quantization (nearly full precision)
F16 = 16-bit floating point (full precision)

The difference is file size and VRAM usage:

Llama 2 7B model sizes:
F16 (full): ~13 GB
Q8: ~7 GB
Q6_K: ~6 GB
Q5_K_M: ~5 GB
Q4_K_M: ~4 GB
Q3_K_M: ~3 GB

Lower bit depth = smaller model = lower VRAM = faster inference (usually). But quality degrades.

When Q3 Actually Makes Sense

Q3 is extreme compression, and it shows. Hallucinations increase, reasoning gets flaky. Avoid it for tasks requiring nuance.

But if you’re running a small model for classification or simple extraction, Q3 is legitimately useful:

# Example: Q3 works fine for intent detection
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'neural-chat:latest',
'prompt': 'Classify as "support", "sales", or "billing": "My invoice is wrong"',
'stream': False,
'num_predict': 20
})
# Output: "billing" ✓
# Q3 handles this well. No need for Q5.

When Q3 is reasonable:

When Q3 fails:

When Q5 or Q6 Actually Wins

Q5 sits between Q4 and Q6. It’s a good middle ground if you have the VRAM.

Terminal window
# VRAM usage comparison
# Running mistral 7B:
# Q4_K_M: ~6.5 GB
# Q5_K_M: ~7.5 GB
# Q6_K: ~8.5 GB
# On a 16GB GPU, Q5_K_M + some headroom = solid

Q5 is worth it if:

Q6 is overkill for most self-hosted scenarios. You’re paying heavily in VRAM for barely-noticeable quality gains over Q5.

Testing on Your Hardware

Don’t guess. Benchmark.

#!/bin/bash
# Test Q4 vs Q5 on your hardware
models=("mistral:7b-instruct-q4_k_m" "mistral:7b-instruct-q5_k_m")
prompt="Explain how OAuth 2.0 works in 100 tokens."
for model in "${models[@]}"; do
echo "Testing $model..."
echo "Loading model..."
start=$(date +%s%N)
# First request (includes model load)
curl -s http://localhost:11434/api/generate \
-d "{\"model\": \"$model\", \"prompt\": \"$prompt\", \"num_predict\": 100}" | \
python3 -c "import sys, json; data=json.load(sys.stdin); print(f'First load: {data.get(\"load_duration\", 0)/1e9:.2f}s')"
# Subsequent requests (model stays loaded)
for i in {1..3}; do
curl -s http://localhost:11434/api/generate \
-d "{\"model\": \"$model\", \"prompt\": \"$prompt\", \"num_predict\": 100}" | \
python3 -c "import sys, json; data=json.load(sys.stdin); print(f'Request {$i}: {data.get(\"eval_duration\", 0)/1e9:.2f}s')"
done
echo ""
done

Watch for:

The Real Trade-off: Batch Size vs. Quantization

Here’s what nobody mentions: you can often get better throughput by staying at Q4 and increasing batch size, rather than dropping to Q3.

Scenario: You're serving API requests to 50 concurrent users
Option A: Q3, batch size 1
- VRAM: 3GB
- Throughput: 50 sequential requests (slow)
Option B: Q4_K_M, batch size 8
- VRAM: 5GB (model) + 2GB (batch) = 7GB
- Throughput: 8 requests in parallel (much faster)
- Quality: Noticeably better

If you have 8GB VRAM or more, batch inference at Q4 beats solo Q3 in almost every metric.

The Honest Recommendation

Test on your hardware with your workload. The benchmark that matters is yours, not someone’s blog post.

How to Check a Model’s Current Quantization

Terminal window
# Ollama: check model details
ollama show mistral:latest --modelfile | grep FROM
# llama.cpp: check file info
./llama-cli --info --model ./mistral-7b-q4_k_m.gguf
# Just from the filename — Q4_K_M.gguf means Q4_K_M quantization
ls -lah *.gguf

The GGUF filename usually tells you everything you need: mistral-7b-instruct-v0.2.Q4_K_M.gguf is a 7B Mistral model, instruction-tuned, Q4_K_M quantization.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
Traefik: Docker Routing with Labels
Next Post
Docker BuildKit: Stop Waiting for Your Images to Build

Related Posts