Skip to content
Go back

GPU Memory Math: Will This Model Actually Fit?

Updated:
By SumGuy 6 min read
GPU Memory Math: Will This Model Actually Fit?

The Question Everyone Asks

“I have a 24GB GPU. Can I run Llama 2 70B?”

The answer depends on quantization, context window, and batch size. And people always guess wrong.

The Basic Formula

Here’s what you actually need:

VRAM needed = (Model parameters × bytes per parameter) + KV cache + activation overhead

The first part is straightforward. A model’s weights are fixed—they’re just numbers stored on disk. But how many bytes depends on precision:

Example: Llama 2 7B, different quantizations

F16 (full precision):
7B × 2 bytes = 14 GB
Q8 (8-bit):
7B × 1 byte = 7 GB
Q4_K_M (4-bit):
7B × 0.5 bytes ≈ 3.5 GB

This is model weights alone. The rest is where people stumble.

Quantization Formats: The Trade-offs

If you’re using GGUF format (llama.cpp, Ollama, LM Studio), you have quantization options:

FormatBytes/paramQuality lossSpeed penaltyWhen to use
F162.0NoneBaselineRTX 4090 with headroom
Q81.0NegligibleMinimal16GB+ GPUs, quality-first
Q5_K_M0.625SlightNoneSweet spot for most folks
Q4_K_M0.5Noticeable but acceptableNoneBudget GPUs, still solid
Q3_K_M0.375More noticeableNoneSqueeze large models onto 8GB

Real talk: Q4_K_M is where 90% of people land. It’s a 75% compression ratio, inference is still fast, and quality is fine for chat.

The Overhead Gotcha

When you load a model for inference, the GPU needs extra memory for:

1. KV cache (conversation history cached in VRAM)

Every token in your context history gets cached as key-value pairs. The formula:

KV cache = 2 × batch_size × context_length × num_layers × hidden_dim × precision_bytes

For Llama 2 7B with 4K context, batch size 1, F16: roughly 500MB–600MB. For Llama 2 70B with 8K context, batch size 1, F16: roughly 5–6GB.

2. Activation buffers (intermediate computations during inference)

Running forward passes generates temporary tensors. Budget 10–20% of model weight size.

3. Framework overhead (llama.cpp, vLLM, Ollama reserve buffers)

Another 5–10%.

Real-world total overhead: 15–30% extra on top of model weights. Always add it.

Llama 2 7B Q4_K_M example:
Model weights: 3.5 GB
KV cache (4K): 0.6 GB
Activations: 0.5 GB (15% of model)
Framework: 0.4 GB
─────────────────────────
Total: 5.0 GB

Context Window = Cache Explosion

This is the killer variable. Larger context windows = massive KV cache.

Llama 2 7B Q4_K_M, batch size 1:
4K context: 3.5 GB model + 0.6 GB cache = 4.1 GB
8K context: 3.5 GB model + 1.2 GB cache = 4.7 GB
16K context: 3.5 GB model + 2.4 GB cache = 5.9 GB
32K context: 3.5 GB model + 4.8 GB cache = 8.3 GB

If you’re tight on VRAM, limit context per request:

Terminal window
# Ollama with 2K context instead of 4K
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "What is Docker?",
"num_ctx": 2048,
"stream": false
}'

Reference Table: Common Models + Quantizations

Model F16 Q8 Q5_K_M Q4_K_M Notes
──────────────── ──── ── ────── ────── ──────────────────
Mistral 7B 14 GB 7 GB 4.5 GB 3.5 GB Fastest 7B
Llama 2 7B 14 GB 7 GB 4.5 GB 3.2 GB Baseline
Llama 3 8B 16 GB 8 GB 5 GB 4 GB Better than Llama 2
Llama 2 13B 26 GB 13 GB 8 GB 6.5 GB Still single-GPU
Llama 3 70B 140 GB 70 GB 45 GB 35 GB Needs A100/RTX 6000
Mistral 48x8B ~90 GB 45 GB ~30 GB ~22 GB MoE, sparse
Claude 3 Sonnet ~50 GB ~25 GB ~16 GB ~12 GB Closed, estimates only

Add 1–3GB to these for KV cache overhead depending on context window and batch size.

Batch Size Multiplier

Processing multiple requests in parallel multiplies KV cache:

Llama 2 7B Q4_K_M, 4K context:
Batch 1: 0.6 GB cache
Batch 4: 2.4 GB cache
Batch 8: 4.8 GB cache

Single user? Batch 1 is fine. Running an inference API with 10 concurrent users? Budget for batch 4–8.

Quick Calculator

def vram_needed(model_params_b, quantization, context_window, batch_size=1):
"""Estimate VRAM in GB."""
bits = {'f16': 16, 'q8': 8, 'q5': 5, 'q4': 4}
# Model weights
model_gb = (model_params_b * 1e9 * bits[quantization]) / (8 * 1e9)
# KV cache (simplified: ~150 bytes per token per billion params)
cache_gb = (context_window * batch_size * model_params_b * 150) / 1e9
# Overhead (20%)
overhead_gb = model_gb * 0.20
total = model_gb + cache_gb + overhead_gb
return round(total, 2)
# Examples
print(vram_needed(7, 'q4', 4096, batch_size=1)) # 4.7 GB
print(vram_needed(70, 'q4', 4096, batch_size=1)) # 43.2 GB
print(vram_needed(13, 'q5', 8192, batch_size=4)) # 15.8 GB

Check Your Available VRAM

Before you start:

Terminal window
# NVIDIA GPUs
nvidia-smi --query-gpu=memory.total,memory.free --format=csv,nocheck
# AMD GPUs (ROCm)
rocm-smi --showproductname --showmeminfo

Fallback: CPU Offloading

Don’t have enough VRAM? Offload layers to CPU (slower, but works):

Terminal window
# Ollama: use num_gpu parameter
OLLAMA_NUM_GPU=10 ollama run mistral
# llama.cpp: use -ngl (number of GPU layers)
./main -m model.gguf -ngl 20 -p "Your prompt"

GPU layers run fast. CPU layers run ~10x slower. Mixed is a compromise.

The Decision Tree

The Hard Truth

People underestimate overhead by 20–30%.

If you calculate “your model needs 4.5 GB,” don’t load it on a 6GB card. When inference spikes with activations, you’ll get OOM errors mid-generation.

Run the math, subtract 1GB as a safety buffer, and you’ll actually know what fits.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
Find out whats taking up all the hdd space
Next Post
Linux CLI Tarball Extraction — Flags, Formats, Gotchas

Related Posts