LLM Quantization: Q4_K_M Isn't Always the Best Choice

The Myth of Q4_K_M

Everyone says Q4_K_M is the sweet spot: “4-bit, works great, barely any quality loss.” It’s the default recommendation, the safe choice. But here’s the truth: it’s optimal for most people’s hardware and most use cases. Not all.

If you’re always running Q4_K_M without testing alternatives, you’re leaving performance or quality on the table.

Understanding Quantization Notation

Quantization compresses a model by reducing the precision of its weights.

Q3_K_M   = 3-bit quantization, medium variant
Q4_K_M   = 4-bit quantization, medium variant (most common)
Q5_K_M   = 5-bit quantization, medium variant
Q6_K    = 6-bit quantization
Q8      = 8-bit quantization (nearly full precision)
F16     = 16-bit floating point (full precision)

The difference is file size and VRAM usage:

Llama 2 7B model sizes:
F16 (full):  ~13 GB
Q8:          ~7 GB
Q6_K:        ~6 GB
Q5_K_M:      ~5 GB
Q4_K_M:      ~4 GB
Q3_K_M:      ~3 GB

Lower bit depth = smaller model = lower VRAM = faster inference (usually). But quality degrades.

When Q3 Actually Makes Sense

Q3 is extreme compression, and it shows. Hallucinations increase, reasoning gets flaky. Avoid it for tasks requiring nuance.

But if you’re running a small model for classification or simple extraction, Q3 is legitimately useful:

# Example: Q3 works fine for intent detection
import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'neural-chat:latest',
    'prompt': 'Classify as "support", "sales", or "billing": "My invoice is wrong"',
    'stream': False,
    'num_predict': 20
})

# Output: "billing" ✓
# Q3 handles this well. No need for Q5.

When Q3 is reasonable:

Intent classification (support, sales, billing, etc.)
Named entity extraction from structured text
Simple category prediction
High-volume inference on limited hardware

When Q3 fails:

Long-form reasoning
Code generation
Creative writing
Anything requiring coherent thought over 100+ tokens

When Q5 or Q6 Actually Wins

Q5 sits between Q4 and Q6. It’s a good middle ground if you have the VRAM.

# VRAM usage comparison
# Running mistral 7B:
# Q4_K_M: ~6.5 GB
# Q5_K_M: ~7.5 GB
# Q6_K:   ~8.5 GB

# On a 16GB GPU, Q5_K_M + some headroom = solid

Q5 is worth it if:

You have the VRAM (12GB+)
You’re doing code generation (quality matters more)
You need long-context accuracy (hallucinations compound)
You want one model for everything (jack of all trades)

Q6 is overkill for most self-hosted scenarios. You’re paying heavily in VRAM for barely-noticeable quality gains over Q5.

Testing on Your Hardware

Don’t guess. Benchmark.

#!/bin/bash
# Test Q4 vs Q5 on your hardware

models=("mistral:7b-instruct-q4_k_m" "mistral:7b-instruct-q5_k_m")
prompt="Explain how OAuth 2.0 works in 100 tokens."

for model in "${models[@]}"; do
    echo "Testing $model..."
    echo "Loading model..."
    start=$(date +%s%N)

    # First request (includes model load)
    curl -s http://localhost:11434/api/generate \
        -d "{\"model\": \"$model\", \"prompt\": \"$prompt\", \"num_predict\": 100}" | \
        python3 -c "import sys, json; data=json.load(sys.stdin); print(f'First load: {data.get(\"load_duration\", 0)/1e9:.2f}s')"

    # Subsequent requests (model stays loaded)
    for i in {1..3}; do
        curl -s http://localhost:11434/api/generate \
            -d "{\"model\": \"$model\", \"prompt\": \"$prompt\", \"num_predict\": 100}" | \
            python3 -c "import sys, json; data=json.load(sys.stdin); print(f'Request {$i}: {data.get(\"eval_duration\", 0)/1e9:.2f}s')"
    done
    echo ""
done

Watch for:

Load time (first request, includes model to VRAM)
Eval time (how long to generate tokens)
Quality (does output make sense?)

The Real Trade-off: Batch Size vs. Quantization

Here’s what nobody mentions: you can often get better throughput by staying at Q4 and increasing batch size, rather than dropping to Q3.

Scenario: You're serving API requests to 50 concurrent users

Option A: Q3, batch size 1
- VRAM: 3GB
- Throughput: 50 sequential requests (slow)

Option B: Q4_K_M, batch size 8
- VRAM: 5GB (model) + 2GB (batch) = 7GB
- Throughput: 8 requests in parallel (much faster)
- Quality: Noticeably better

If you have 8GB VRAM or more, batch inference at Q4 beats solo Q3 in almost every metric.

The Honest Recommendation

Limited VRAM (4GB)? Q3 for simple tasks, Q4 for general use.
Comfortable VRAM (8GB)? Q4_K_M, test Q5 if you care about quality.
Plenty of VRAM (16GB+)? Q5_K_M is your default. Q6 only if you’re hitting quality issues.
Running inference at scale? Don’t optimize quantization alone—optimize batch size + quantization together.

Test on your hardware with your workload. The benchmark that matters is yours, not someone’s blog post.

How to Check a Model’s Current Quantization

# Ollama: check model details
ollama show mistral:latest --modelfile | grep FROM

# llama.cpp: check file info
./llama-cli --info --model ./mistral-7b-q4_k_m.gguf

# Just from the filename — Q4_K_M.gguf means Q4_K_M quantization
ls -lah *.gguf

The GGUF filename usually tells you everything you need: mistral-7b-instruct-v0.2.Q4_K_M.gguf is a 7B Mistral model, instruction-tuned, Q4_K_M quantization.