Skip to content
Go back

Running Multiple Ollama Models Without Running Out of RAM

By SumGuy 5 min read
Running Multiple Ollama Models Without Running Out of RAM

The Problem

You want to use Mistral for one task, Llama for another. Ollama loads one model at a time. Switch between them, and the first unloads (eventually). But if both are large, VRAM fills up.

On 8GB VRAM, you can’t realistically run two 7B models in parallel. You need strategies.

Strategy 1: CPU Offloading (The Slowest But Works)

Let GPU handle part of the model, CPU handles the rest.

Terminal window
# Set environment variable before starting Ollama
OLLAMA_NUM_GPU=35 ollama serve
# Or for specific models
# OLLAMA_NUM_GPU=25 ollama run mistral

OLLAMA_NUM_GPU=35 means “move 35 layers to GPU, rest to CPU.”

This isn’t a magic bullet—CPU inference is 10–100x slower than GPU. But it allows running larger models on small GPUs.

Terminal window
# Example: 7B model with Q4 quantization
# Full VRAM: 4 GB
# GPU offload 50%: 2 GB VRAM + CPU fills the gap
# Inference speed: 10–20 tokens/sec instead of 100+ tokens/sec

Strategy 2: Use Smaller Quantizations

Switch between Q3 and Q5 instead of switching between models.

Mistral 7B Q5: 5.5 GB
Mistral 7B Q4: 3.5 GB
Mistral 7B Q3: 2.5 GB

If you need two models simultaneously, could you use the same model at different quantizations? Or one full model + one Q3 version?

Terminal window
# Load Mistral Q4 for main work
ollama pull mistral:q4_k_m
# Also pull Q3 for backup/lightweight tasks
ollama pull mistral:q3_k_m
# In your API, switch between them based on the task
if task == "complex_reasoning":
model = "mistral:q4_k_m"
else:
model = "mistral:q3_k_m"

Strategy 3: Run Multiple Ollama Instances on Different Ports

Each instance manages its own VRAM separately.

Terminal window
# Terminal 1: Heavy tasks on GPU
OLLAMA_HOST=127.0.0.1:11434 ollama serve
# Terminal 2: Light tasks (CPU offload or small models)
OLLAMA_HOST=127.0.0.1:11435 OLLAMA_NUM_GPU=0 ollama serve
# Terminal 3: Medium tasks (partial offload)
OLLAMA_HOST=127.0.0.1:11436 OLLAMA_NUM_GPU=20 ollama serve

Now distribute work:

import requests
def call_ollama(task_type, prompt):
ports = {
'heavy': 11434,
'light': 11435,
'medium': 11436
}
port = ports.get(task_type, 11434)
response = requests.post(f'http://127.0.0.1:{port}/api/generate', json={
'model': 'mistral',
'prompt': prompt,
'stream': False
})
return response.json()['response']
# Routes requests to different instances
result = call_ollama('heavy', 'Complex reasoning task')
result = call_ollama('light', 'Simple classification')

Each instance unloads models independently. Great for balancing load.

Strategy 4: Pre-calculate and Cache

Don’t run inference on every request. Cache results.

from functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def generate_embedding_cached(text):
# Ollama embedding doesn't use much VRAM
# Cache prevents re-computing
response = requests.post('http://127.0.0.1:11434/api/embeddings', json={
'model': 'nomic-embed-text:v1.5',
'prompt': text
})
return tuple(response.json()['embedding'])
# First call: compute
embedding1 = generate_embedding_cached("How do I use Docker?")
# Second call: cache hit (instant)
embedding2 = generate_embedding_cached("How do I use Docker?")

This works great for RAG—you embed documents once, reuse forever.

Strategy 5: The Keep-Alive Dance

Tune OLLAMA_KEEP_ALIVE to unload models more aggressively.

Terminal window
# Keep models in VRAM for only 30 seconds
OLLAMA_KEEP_ALIVE=30s ollama serve

Combined with request batching, you can simulate multi-model support:

Terminal window
# Pseudo-code workflow:
# 1. Run Mistral inference (loaded)
# 2. Query immediately (cache hit)
# 3. Switch to Llama (Mistral unloads after 30s)
# 4. Query Llama
# 5. Mistral is now unloaded, freeing 4GB VRAM

Practical Example: API with Smart Load Balancing

import requests
import time
from collections import defaultdict
class OllamaMultiManager:
def __init__(self, base_url='http://127.0.0.1:11434', keep_alive=60):
self.base_url = base_url
self.keep_alive = keep_alive
self.model_loads = defaultdict(float)
self.current_model = None
def should_switch(self, requested_model):
"""Decide if we should switch models."""
if self.current_model is None:
return True
# If same model, keep it loaded
if self.current_model == requested_model:
return False
# If time since last load > keep_alive, model is unloaded
time_since_load = time.time() - self.model_loads[self.current_model]
if time_since_load > self.keep_alive:
return True
# Otherwise, decide based on priority
return requested_model.startswith('priority_')
def generate(self, model, prompt):
"""Generate with smart model switching."""
if self.should_switch(model):
print(f"Switching to {model}")
self.current_model = model
self.model_loads[model] = time.time()
response = requests.post(f'{self.base_url}/api/generate', json={
'model': model,
'prompt': prompt,
'stream': False
})
return response.json()['response']
# Usage
manager = OllamaMultiManager(keep_alive=60)
# Heavy task
result1 = manager.generate('mistral:q4_k_m', 'Explain quantum computing')
# Quick task (same model, cached)
result2 = manager.generate('mistral:q4_k_m', 'What is 2+2?')
# Different model (waits if needed, or switches if 60s passed)
result3 = manager.generate('neural-chat:latest', 'Classify sentiment')

The Honest Trade-offs

StrategyVRAMSpeedComplexityBest For
CPU offload2–4 GBSlow (10–50 tok/s)LowSingle large model on tiny GPU
Smaller quants5–6 GBFast (100+ tok/s)Very LowOne model, acceptable quality loss
Multiple instances8–16 GBVery FastMediumLoad balancing across tasks
CachingVariesInstant (cache hit)LowRepeated queries (RAG)
Aggressive keep-alive4–6 GBNormalLowSequential model switching

What Actually Works

For most people with 8–16GB VRAM:

  1. Keep one model loaded (Q4 quantization, ~4–5 GB)
  2. Use CPU offload if you need a second model (set OLLAMA_NUM_GPU=20, accept slower speeds)
  3. Cache everything (embeddings, common queries)
  4. Tune keep-alive to 30–60 seconds (balance speed vs. VRAM reuse)

You’re not building a massive multi-model inference cluster. You’re running a chatbot on a gaming laptop. Be pragmatic.

Your 2 AM self will appreciate it when the setup is simple and just works.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
Docker BuildKit: Stop Waiting for Your Images to Build
Next Post
Pi-hole vs AdGuard Home: Block Ads for Every Device on Your Network

Related Posts