Skip to content
Go back

Ollama Memory Management: Why Models Keep Loading

By SumGuy 6 min read
Ollama Memory Management: Why Models Keep Loading

The VRAM Mystery

You fire up Ollama, load a model, run a few queries, then load a different model. Your GPU fills up. Then you load a third one and suddenly you’re swapping to CPU. What’s happening? Ollama doesn’t unload models between requests by default—it keeps them in VRAM.

This is actually intentional. A model sitting in GPU memory is fast. Reloading it from disk is slow. But if you’re juggling multiple models on limited hardware, this behavior gets painful fast.

Understanding Ollama’s Memory Model

Ollama holds loaded models in VRAM until they time out or you explicitly unload them. The key setting is OLLAMA_KEEP_ALIVE.

By default, a model stays loaded for 5 minutes after you stop querying it. After that, Ollama unloads it to free VRAM.

Terminal window
# Check what's currently loaded
curl http://localhost:11434/api/tags
# Response shows all models, but doesn't tell you VRAM usage directly

To see actual GPU memory consumption, use your system tools:

Terminal window
# On NVIDIA
nvidia-smi -l 1 # Refresh every 1 second
# On AMD
rocm-smi -i 0 --json | jq '..*[] | select(.mem_used) | .mem_used'

Watch nvidia-smi while you query a model. You’ll see the model’s weights loaded into VRAM, then sit there for 5 minutes.

Tuning Keep-Alive

Change how long Ollama keeps a model loaded with environment variables:

Terminal window
# Load model, keep it in VRAM for 30 seconds
OLLAMA_KEEP_ALIVE=30s ollama serve
# Or set it permanently (Linux)
echo 'OLLAMA_KEEP_ALIVE=2m' >> ~/.profile
source ~/.profile

Keep-alive values:

Terminal window
# Force immediate unload (pragmatically)
# Stop Ollama, restart it (hard unload)
pkill ollama
sleep 2
ollama serve

Checking Loaded Models Programmatically

The API doesn’t expose VRAM usage, but you can infer it:

import requests
import subprocess
import re
# Get list of loaded models
response = requests.get('http://localhost:11434/api/tags')
models = response.json()['models']
# Get NVIDIA VRAM usage
vram_output = subprocess.check_output(['nvidia-smi', '--query-gpu=memory.used', '--format=csv,nounits,noheader']).decode()
vram_used = int(vram_output.strip())
print(f"Models loaded: {[m['name'] for m in models]}")
print(f"VRAM used: {vram_used} MB")

The Hard Choice: Context Windows vs. Memory

Larger context windows = more VRAM per model. A 7B model with 4K context uses ~6GB VRAM. The same model with 32K context? ~14GB.

If you’re maxing out VRAM, you have three options:

  1. Reduce context window per request — Use num_ctx parameter
  2. Switch to smaller quantization — Q3_K instead of Q5_K
  3. Use CPU offloading — Trade speed for space
Terminal window
# Reduce context window to 2K
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Why is memory hard?",
"stream": false,
"num_ctx": 2048
}'

The Real Problem: Unplanned Persistence

Ollama’s default behavior assumes you’re running one model repeatedly. If you’re switching between models constantly—or running multiple models via a load balancer—you need to be explicit about unloading.

Here’s the thing: there’s no “unload this model” API endpoint. You either wait for keep-alive to expire, or restart Ollama.

Some workarounds:

Terminal window
# Monitor and log what's loaded every minute
watch -n 60 'nvidia-smi | grep ollama'
# If you're running multiple instances, use separate ports
OLLAMA_HOST=127.0.0.1:11434 ollama serve &
OLLAMA_HOST=127.0.0.1:11435 ollama serve &
# Each instance manages its own VRAM independently

Memory Pressure: What Happens When You Exceed VRAM

You will run out of VRAM eventually. When Ollama tries to load a model and there’s not enough space, here’s what happens:

  1. It writes the overflow to disk (slow path)
  2. Your system swap activates if available
  3. Or the kernel OOMs and kills processes

Watch for this in real time:

Terminal window
# Terminal 1: Monitor VRAM + swap in real-time
watch -n 1 'nvidia-smi --query-gpu=memory.used,memory.free --format=csv,nounits,noheader; free -h | grep Swap'
# Terminal 2: Load your model and start querying
curl http://localhost:11434/api/generate -d '{
"model": "mistral:latest",
"prompt": "Explain quantum computing in 2000 words",
"stream": false
}'

If VRAM fills and swap starts climbing, you’re in pain. Queries slow down 10-100x.

The Quantization Solution

Smaller quantization = smaller model = less VRAM. Trade quality for space:

Example: A 13B model in Q8 uses ~8GB. Same model in Q4_K? ~2GB.

Terminal window
# Pull a smaller quantization
ollama pull mistral:7b-q4_k
# Check available models
ollama list

For home lab setups, Q4_K is the sweet spot. You get reasonable quality without the VRAM debt.

Monitoring Loaded Models Over Time

Set up basic monitoring to catch creeping memory usage:

import requests
import json
from datetime import datetime
url = 'http://localhost:11434/api/tags'
while True:
try:
resp = requests.get(url)
models = [m['name'] for m in resp.json().get('models', [])]
print(f"[{datetime.now().isoformat()}] Loaded: {', '.join(models) or 'none'}")
except Exception as e:
print(f"Error: {e}")
# Sleep 30 seconds before checking again
import time
time.sleep(30)

Run this in a screen/tmux session and check periodically. Unload models hogging space by restarting Ollama if necessary.

Checking Currently Loaded Models (Newer Ollama)

Ollama 0.1.24+ added a /api/ps endpoint that shows exactly what’s loaded in memory right now:

Terminal window
curl http://localhost:11434/api/ps | jq '.models[] | {name, size_vram}'

Output:

{
"name": "mistral:7b-q4_k",
"size_vram": 4073415168
}

That size_vram is bytes. Divide by 1073741824 for GB. Much better than guessing from nvidia-smi.

Running Ollama as a Systemd Service

If you’re running Ollama on a server, configure OLLAMA_KEEP_ALIVE in the systemd unit:

Terminal window
sudo systemctl edit ollama
/etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=1m"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Terminal window
sudo systemctl daemon-reload
sudo systemctl restart ollama

OLLAMA_MAX_LOADED_MODELS caps how many models stay resident simultaneously. Set it to match your available VRAM divided by your typical model size.

Bottom Line

Ollama’s memory persistence is a feature, not a bug. It’s optimized for the happy path: single model, repeated queries. If you’re running multiple models, either embrace the 5-minute window, tune OLLAMA_KEEP_ALIVE to your workflow, or split models across separate Ollama instances. Use /api/ps to see exactly what’s loaded. Understanding quantization lets you fit more models in less space. Your 2 AM self will thank you for understanding this before production breaks.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
Compression in 2026: zstd Changed the Game
Next Post
RAG on a Budget: Building a Knowledge Base with Ollama & ChromaDB

Related Posts