Skip to content
Go back

Ollama Memory Management: Why Models Keep Loading

By SumGuy 7 min read
Ollama Memory Management: Why Models Keep Loading

The VRAM Mystery

You fire up Ollama, load a model, run a few queries, then load a different model. Your GPU fills up. Then you load a third one and suddenly you’re swapping to CPU. What’s happening? Ollama doesn’t unload models between requests by default—it keeps them in VRAM.

This is actually intentional. A model sitting in GPU memory is fast. Reloading it from disk is slow. But if you’re juggling multiple models on limited hardware, this behavior gets painful fast.

TL;DR: Ollama holds loaded models in VRAM for 5 minutes after the last request. Override with OLLAMA_KEEP_ALIVE=<duration> globally, or pass "keep_alive": <duration> per request. Use "keep_alive": 0 to force-unload immediately. Inspect with ollama ps (newer builds) or nvidia-smi -l 1 live.

Jump to: Keep-alive tuning · Force-unload via API · Context window cost · Memory pressure & quantization · ollama ps

Understanding Ollama’s Memory Model

Ollama holds loaded models in VRAM until they time out or you explicitly unload them. The key setting is OLLAMA_KEEP_ALIVE.

By default, a model stays loaded for 5 minutes after you stop querying it. After that, Ollama unloads it to free VRAM.

Terminal window
# Check what's currently loaded
curl http://localhost:11434/api/tags
# Response shows all models, but doesn't tell you VRAM usage directly

To see actual GPU memory consumption, use your system tools:

Terminal window
# On NVIDIA
nvidia-smi -l 1 # Refresh every 1 second
# On AMD
rocm-smi -i 0 --json | jq '..*[] | select(.mem_used) | .mem_used'

Watch nvidia-smi while you query a model. You’ll see the model’s weights loaded into VRAM, then sit there for 5 minutes.

Tuning Keep-Alive

Change how long Ollama keeps a model loaded with environment variables:

Terminal window
# Load model, keep it in VRAM for 30 seconds
OLLAMA_KEEP_ALIVE=30s ollama serve
# Or set it permanently (Linux)
echo 'OLLAMA_KEEP_ALIVE=2m' >> ~/.profile
source ~/.profile

Keep-alive values:

Terminal window
# Force immediate unload (pragmatically)
# Stop Ollama, restart it (hard unload)
pkill ollama
sleep 2
ollama serve

Checking Loaded Models Programmatically

The API doesn’t expose VRAM usage, but you can infer it:

import requests
import subprocess
import re
# Get list of loaded models
response = requests.get('http://localhost:11434/api/tags')
models = response.json()['models']
# Get NVIDIA VRAM usage
vram_output = subprocess.check_output(['nvidia-smi', '--query-gpu=memory.used', '--format=csv,nounits,noheader']).decode()
vram_used = int(vram_output.strip())
print(f"Models loaded: {[m['name'] for m in models]}")
print(f"VRAM used: {vram_used} MB")

The Hard Choice: Context Windows vs. Memory

Larger context windows = more VRAM per model. A 7B model with 4K context uses ~6GB VRAM. The same model with 32K context? ~14GB.

If you’re maxing out VRAM, you have three options:

  1. Reduce context window per request — Use num_ctx parameter
  2. Switch to smaller quantization — Q3_K instead of Q5_K
  3. Use CPU offloading — Trade speed for space
Terminal window
# Reduce context window to 2K
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Why is memory hard?",
"stream": false,
"num_ctx": 2048
}'

The Real Problem: Unplanned Persistence

Ollama’s default behavior assumes you’re running one model repeatedly. If you’re switching between models constantly — or running multiple models via a load balancer — you need to be explicit about unloading.

The clean way to force an unload is the keep_alive request parameter. Set it to 0 on any inference endpoint and the model unloads as soon as the request returns. You can also fire a no-op request with an empty prompt purely to trigger the unload:

Terminal window
# Force-unload a specific model right now — no waiting, no restart
curl http://localhost:11434/api/generate -d '{
"model": "mistral:7b-q4_k",
"keep_alive": 0
}'

keep_alive works on /api/generate, /api/chat, and /api/embed. It accepts the same values as OLLAMA_KEEP_ALIVE — a duration string like "5m", a negative number for “never unload”, or 0 for “unload immediately”. Per-request beats global, so this overrides whatever the server default is.

Some other workarounds when you genuinely can’t change the request shape:

Terminal window
# Monitor and log what's loaded every minute
watch -n 60 'nvidia-smi | grep ollama'
# If you're running multiple instances, use separate ports
OLLAMA_HOST=127.0.0.1:11434 ollama serve &
OLLAMA_HOST=127.0.0.1:11435 ollama serve &
# Each instance manages its own VRAM independently

Memory Pressure: What Happens When You Exceed VRAM

You will run out of VRAM eventually. When Ollama tries to load a model and there’s not enough space, here’s what happens:

  1. It writes the overflow to disk (slow path)
  2. Your system swap activates if available
  3. Or the kernel OOMs and kills processes

Watch for this in real time:

Terminal window
# Terminal 1: Monitor VRAM + swap in real-time
watch -n 1 'nvidia-smi --query-gpu=memory.used,memory.free --format=csv,nounits,noheader; free -h | grep Swap'
# Terminal 2: Load your model and start querying
curl http://localhost:11434/api/generate -d '{
"model": "mistral:latest",
"prompt": "Explain quantum computing in 2000 words",
"stream": false
}'

If VRAM fills and swap starts climbing, you’re in pain. Queries slow down 10-100x.

The Quantization Solution

Smaller quantization = smaller model = less VRAM. Trade quality for space:

Example: A 13B model in Q8 uses ~8GB. Same model in Q4_K? ~2GB.

Terminal window
# Pull a smaller quantization
ollama pull mistral:7b-q4_k
# Check available models
ollama list

For home lab setups, Q4_K is the sweet spot. You get reasonable quality without the VRAM debt.

Monitoring Loaded Models Over Time

Set up basic monitoring to catch creeping memory usage:

import requests
import json
from datetime import datetime
url = 'http://localhost:11434/api/tags'
while True:
try:
resp = requests.get(url)
models = [m['name'] for m in resp.json().get('models', [])]
print(f"[{datetime.now().isoformat()}] Loaded: {', '.join(models) or 'none'}")
except Exception as e:
print(f"Error: {e}")
# Sleep 30 seconds before checking again
import time
time.sleep(30)

Run this in a screen/tmux session and check periodically. Unload models hogging space by restarting Ollama if necessary.

Checking Currently Loaded Models (Newer Ollama)

Ollama 0.1.24+ added a /api/ps endpoint that shows exactly what’s loaded in memory right now:

Terminal window
curl http://localhost:11434/api/ps | jq '.models[] | {name, size_vram}'

Output:

{
"name": "mistral:7b-q4_k",
"size_vram": 4073415168
}

That size_vram is bytes. Divide by 1073741824 for GB. Much better than guessing from nvidia-smi.

Running Ollama as a Systemd Service

If you’re running Ollama on a server, configure OLLAMA_KEEP_ALIVE in the systemd unit:

Terminal window
sudo systemctl edit ollama
/etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=1m"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Terminal window
sudo systemctl daemon-reload
sudo systemctl restart ollama

OLLAMA_MAX_LOADED_MODELS caps how many models stay resident simultaneously. Set it to match your available VRAM divided by your typical model size. Ollama has reshuffled environment-variable names between releases, so if the variable is rejected on your version, run ollama help serve and check the current list — but the per-request keep_alive: 0 trick above works everywhere.

Bottom Line

Ollama’s memory persistence is a feature, not a bug. It’s optimized for the happy path: single model, repeated queries. If you’re running multiple models, either embrace the 5-minute window, tune OLLAMA_KEEP_ALIVE to your workflow, or split models across separate Ollama instances. Use /api/ps to see exactly what’s loaded. Understanding quantization lets you fit more models in less space. Your 2 AM self will thank you for understanding this before production breaks.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
Compression in 2026: zstd Changed the Game
Next Post
RAG on a Budget: Building a Knowledge Base with Ollama & ChromaDB

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts