Ollama Memory Management: Why Models Keep Loading

The VRAM Mystery

You fire up Ollama, load a model, run a few queries, then load a different model. Your GPU fills up. Then you load a third one and suddenly you’re swapping to CPU. What’s happening? Ollama doesn’t unload models between requests by default—it keeps them in VRAM.

This is actually intentional. A model sitting in GPU memory is fast. Reloading it from disk is slow. But if you’re juggling multiple models on limited hardware, this behavior gets painful fast.

TL;DR: Ollama holds loaded models in VRAM for 5 minutes after the last request. Override with OLLAMA_KEEP_ALIVE=<duration> globally, or pass "keep_alive": <duration> per request. Use "keep_alive": 0 to force-unload immediately. Inspect with ollama ps (newer builds) or nvidia-smi -l 1 live.

Jump to: Keep-alive tuning · Force-unload via API · Context window cost · Memory pressure & quantization · ollama ps

Understanding Ollama’s Memory Model

Ollama holds loaded models in VRAM until they time out or you explicitly unload them. The key setting is OLLAMA_KEEP_ALIVE.

By default, a model stays loaded for 5 minutes after you stop querying it. After that, Ollama unloads it to free VRAM.

# Check what's currently loaded
curl http://localhost:11434/api/tags

# Response shows all models, but doesn't tell you VRAM usage directly

To see actual GPU memory consumption, use your system tools:

# On NVIDIA
nvidia-smi -l 1    # Refresh every 1 second

# On AMD
rocm-smi -i 0 --json | jq '..*[] | select(.mem_used) | .mem_used'

Watch nvidia-smi while you query a model. You’ll see the model’s weights loaded into VRAM, then sit there for 5 minutes.

Tuning Keep-Alive

Change how long Ollama keeps a model loaded with environment variables:

# Load model, keep it in VRAM for 30 seconds
OLLAMA_KEEP_ALIVE=30s ollama serve

# Or set it permanently (Linux)
echo 'OLLAMA_KEEP_ALIVE=2m' >> ~/.profile
source ~/.profile

Keep-alive values:

30s — Aggressive unloading, good for single-model workflows
2m — Reasonable middle ground, handles back-and-forth conversations
0 — Never unload (keep everything loaded until manual restart)

# Force immediate unload (pragmatically)
# Stop Ollama, restart it (hard unload)
pkill ollama
sleep 2
ollama serve

Checking Loaded Models Programmatically

The API doesn’t expose VRAM usage, but you can infer it:

import requests
import subprocess
import re

# Get list of loaded models
response = requests.get('http://localhost:11434/api/tags')
models = response.json()['models']

# Get NVIDIA VRAM usage
vram_output = subprocess.check_output(['nvidia-smi', '--query-gpu=memory.used', '--format=csv,nounits,noheader']).decode()
vram_used = int(vram_output.strip())

print(f"Models loaded: {[m['name'] for m in models]}")
print(f"VRAM used: {vram_used} MB")

The Hard Choice: Context Windows vs. Memory

Larger context windows = more VRAM per model. A 7B model with 4K context uses ~6GB VRAM. The same model with 32K context? ~14GB.

If you’re maxing out VRAM, you have three options:

Reduce context window per request — Use num_ctx parameter
Switch to smaller quantization — Q3_K instead of Q5_K
Use CPU offloading — Trade speed for space

# Reduce context window to 2K
curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Why is memory hard?",
    "stream": false,
    "num_ctx": 2048
  }'

The Real Problem: Unplanned Persistence

Ollama’s default behavior assumes you’re running one model repeatedly. If you’re switching between models constantly — or running multiple models via a load balancer — you need to be explicit about unloading.

The clean way to force an unload is the keep_alive request parameter. Set it to 0 on any inference endpoint and the model unloads as soon as the request returns. You can also fire a no-op request with an empty prompt purely to trigger the unload:

# Force-unload a specific model right now — no waiting, no restart
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:7b-q4_k",
  "keep_alive": 0
}'

keep_alive works on /api/generate, /api/chat, and /api/embed. It accepts the same values as OLLAMA_KEEP_ALIVE — a duration string like "5m", a negative number for “never unload”, or 0 for “unload immediately”. Per-request beats global, so this overrides whatever the server default is.

Some other workarounds when you genuinely can’t change the request shape:

# Monitor and log what's loaded every minute
watch -n 60 'nvidia-smi | grep ollama'

# If you're running multiple instances, use separate ports
OLLAMA_HOST=127.0.0.1:11434 ollama serve &
OLLAMA_HOST=127.0.0.1:11435 ollama serve &

# Each instance manages its own VRAM independently

Memory Pressure: What Happens When You Exceed VRAM

You will run out of VRAM eventually. When Ollama tries to load a model and there’s not enough space, here’s what happens:

It writes the overflow to disk (slow path)
Your system swap activates if available
Or the kernel OOMs and kills processes

Watch for this in real time:

# Terminal 1: Monitor VRAM + swap in real-time
watch -n 1 'nvidia-smi --query-gpu=memory.used,memory.free --format=csv,nounits,noheader; free -h | grep Swap'

# Terminal 2: Load your model and start querying
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:latest",
  "prompt": "Explain quantum computing in 2000 words",
  "stream": false
}'

If VRAM fills and swap starts climbing, you’re in pain. Queries slow down 10-100x.

The Quantization Solution

Smaller quantization = smaller model = less VRAM. Trade quality for space:

Q8 (8-bit): ~90% of original quality, ~65% of VRAM
Q6 (6-bit): ~85% of original quality, ~50% of VRAM
Q4_K (4-bit): ~80% of original quality, ~25% of VRAM
Q3_K (3-bit): ~70% of original quality, ~15% of VRAM

Example: A 13B model in Q8 uses ~8GB. Same model in Q4_K? ~2GB.

# Pull a smaller quantization
ollama pull mistral:7b-q4_k

# Check available models
ollama list

For home lab setups, Q4_K is the sweet spot. You get reasonable quality without the VRAM debt.

Monitoring Loaded Models Over Time

Set up basic monitoring to catch creeping memory usage:

import requests
import json
from datetime import datetime

url = 'http://localhost:11434/api/tags'
while True:
    try:
        resp = requests.get(url)
        models = [m['name'] for m in resp.json().get('models', [])]
        print(f"[{datetime.now().isoformat()}] Loaded: {', '.join(models) or 'none'}")
    except Exception as e:
        print(f"Error: {e}")

    # Sleep 30 seconds before checking again
    import time
    time.sleep(30)

Run this in a screen/tmux session and check periodically. Unload models hogging space by restarting Ollama if necessary.

Checking Currently Loaded Models (Newer Ollama)

Ollama 0.1.24+ added a /api/ps endpoint that shows exactly what’s loaded in memory right now:

curl http://localhost:11434/api/ps | jq '.models[] | {name, size_vram}'

Output:

{
  "name": "mistral:7b-q4_k",
  "size_vram": 4073415168
}

That size_vram is bytes. Divide by 1073741824 for GB. Much better than guessing from nvidia-smi.

Running Ollama as a Systemd Service

If you’re running Ollama on a server, configure OLLAMA_KEEP_ALIVE in the systemd unit:

sudo systemctl edit ollama

[Service]
Environment="OLLAMA_KEEP_ALIVE=1m"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

sudo systemctl daemon-reload
sudo systemctl restart ollama

OLLAMA_MAX_LOADED_MODELS caps how many models stay resident simultaneously. Set it to match your available VRAM divided by your typical model size. Ollama has reshuffled environment-variable names between releases, so if the variable is rejected on your version, run ollama help serve and check the current list — but the per-request keep_alive: 0 trick above works everywhere.

GPU Memory Math: How Much VRAM Do You Actually Need? — sizing models before you buy a card
Running Gemma 4 Locally with Ollama — concrete walkthrough on a single GPU
Ollama Multi-Model Setups — when you genuinely do want several models resident at once
Ollama Advanced Model Management — Modelfiles, tagging, and pruning

Bottom Line

Ollama’s memory persistence is a feature, not a bug. It’s optimized for the happy path: single model, repeated queries. If you’re running multiple models, either embrace the 5-minute window, tune OLLAMA_KEEP_ALIVE to your workflow, or split models across separate Ollama instances. Use /api/ps to see exactly what’s loaded. Understanding quantization lets you fit more models in less space. Your 2 AM self will thank you for understanding this before production breaks.

Ollama Memory Management: Why Models Keep Loading

The VRAM Mystery

Understanding Ollama’s Memory Model

Tuning Keep-Alive

Checking Loaded Models Programmatically

The Hard Choice: Context Windows vs. Memory

The Real Problem: Unplanned Persistence

Memory Pressure: What Happens When You Exceed VRAM

The Quantization Solution

Monitoring Loaded Models Over Time

Checking Currently Loaded Models (Newer Ollama)

Running Ollama as a Systemd Service

Bottom Line

Responses from around the web

Discussion

Related Posts

Running Multiple Ollama Models Without Running Out of RAM

Ollama Model Management: Beyond ollama run

Running Gemma 4 Locally with Ollama

GPU Memory Math: Will This Model Actually Fit?

Ollama Memory Management: Why Models Keep Loading

The VRAM Mystery

Understanding Ollama’s Memory Model

Tuning Keep-Alive

Checking Loaded Models Programmatically

The Hard Choice: Context Windows vs. Memory

The Real Problem: Unplanned Persistence

Memory Pressure: What Happens When You Exceed VRAM

The Quantization Solution

Monitoring Loaded Models Over Time

Checking Currently Loaded Models (Newer Ollama)

Running Ollama as a Systemd Service

Related Reading

Bottom Line

Responses from around the web

Discussion

Related Posts

Running Multiple Ollama Models Without Running Out of RAM

Ollama Model Management: Beyond ollama run

Running Gemma 4 Locally with Ollama

GPU Memory Math: Will This Model Actually Fit?