Running Multiple Ollama Models Without Running Out of RAM

The Problem

You want to use Mistral for one task, Llama for another. Ollama loads one model at a time. Switch between them, and the first unloads (eventually). But if both are large, VRAM fills up.

On 8GB VRAM, you can’t realistically run two 7B models in parallel. You need strategies.

Strategy 1: CPU Offloading (The Slowest But Works)

Let GPU handle part of the model, CPU handles the rest.

# Set environment variable before starting Ollama
OLLAMA_NUM_GPU=35 ollama serve

# Or for specific models
# OLLAMA_NUM_GPU=25 ollama run mistral

OLLAMA_NUM_GPU=35 means “move 35 layers to GPU, rest to CPU.”

This isn’t a magic bullet—CPU inference is 10–100x slower than GPU. But it allows running larger models on small GPUs.

# Example: 7B model with Q4 quantization
# Full VRAM: 4 GB
# GPU offload 50%: 2 GB VRAM + CPU fills the gap
# Inference speed: 10–20 tokens/sec instead of 100+ tokens/sec

Strategy 2: Use Smaller Quantizations

Switch between Q3 and Q5 instead of switching between models.

Mistral 7B Q5:      5.5 GB
Mistral 7B Q4:      3.5 GB
Mistral 7B Q3:      2.5 GB

If you need two models simultaneously, could you use the same model at different quantizations? Or one full model + one Q3 version?

# Load Mistral Q4 for main work
ollama pull mistral:q4_k_m

# Also pull Q3 for backup/lightweight tasks
ollama pull mistral:q3_k_m

# In your API, switch between them based on the task
if task == "complex_reasoning":
    model = "mistral:q4_k_m"
else:
    model = "mistral:q3_k_m"

Strategy 3: Run Multiple Ollama Instances on Different Ports

Each instance manages its own VRAM separately.

# Terminal 1: Heavy tasks on GPU
OLLAMA_HOST=127.0.0.1:11434 ollama serve

# Terminal 2: Light tasks (CPU offload or small models)
OLLAMA_HOST=127.0.0.1:11435 OLLAMA_NUM_GPU=0 ollama serve

# Terminal 3: Medium tasks (partial offload)
OLLAMA_HOST=127.0.0.1:11436 OLLAMA_NUM_GPU=20 ollama serve

Now distribute work:

import requests

def call_ollama(task_type, prompt):
    ports = {
        'heavy': 11434,
        'light': 11435,
        'medium': 11436
    }

    port = ports.get(task_type, 11434)
    response = requests.post(f'http://127.0.0.1:{port}/api/generate', json={
        'model': 'mistral',
        'prompt': prompt,
        'stream': False
    })
    return response.json()['response']

# Routes requests to different instances
result = call_ollama('heavy', 'Complex reasoning task')
result = call_ollama('light', 'Simple classification')

Each instance unloads models independently. Great for balancing load.

Strategy 4: Pre-calculate and Cache

Don’t run inference on every request. Cache results.

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def generate_embedding_cached(text):
    # Ollama embedding doesn't use much VRAM
    # Cache prevents re-computing
    response = requests.post('http://127.0.0.1:11434/api/embeddings', json={
        'model': 'nomic-embed-text:v1.5',
        'prompt': text
    })
    return tuple(response.json()['embedding'])

# First call: compute
embedding1 = generate_embedding_cached("How do I use Docker?")

# Second call: cache hit (instant)
embedding2 = generate_embedding_cached("How do I use Docker?")

This works great for RAG—you embed documents once, reuse forever.

Strategy 5: The Keep-Alive Dance

Tune OLLAMA_KEEP_ALIVE to unload models more aggressively.

# Keep models in VRAM for only 30 seconds
OLLAMA_KEEP_ALIVE=30s ollama serve

Combined with request batching, you can simulate multi-model support:

# Pseudo-code workflow:
# 1. Run Mistral inference (loaded)
# 2. Query immediately (cache hit)
# 3. Switch to Llama (Mistral unloads after 30s)
# 4. Query Llama
# 5. Mistral is now unloaded, freeing 4GB VRAM

Practical Example: API with Smart Load Balancing

import requests
import time
from collections import defaultdict

class OllamaMultiManager:
    def __init__(self, base_url='http://127.0.0.1:11434', keep_alive=60):
        self.base_url = base_url
        self.keep_alive = keep_alive
        self.model_loads = defaultdict(float)
        self.current_model = None

    def should_switch(self, requested_model):
        """Decide if we should switch models."""
        if self.current_model is None:
            return True

        # If same model, keep it loaded
        if self.current_model == requested_model:
            return False

        # If time since last load > keep_alive, model is unloaded
        time_since_load = time.time() - self.model_loads[self.current_model]
        if time_since_load > self.keep_alive:
            return True

        # Otherwise, decide based on priority
        return requested_model.startswith('priority_')

    def generate(self, model, prompt):
        """Generate with smart model switching."""
        if self.should_switch(model):
            print(f"Switching to {model}")
            self.current_model = model

        self.model_loads[model] = time.time()

        response = requests.post(f'{self.base_url}/api/generate', json={
            'model': model,
            'prompt': prompt,
            'stream': False
        })
        return response.json()['response']

# Usage
manager = OllamaMultiManager(keep_alive=60)

# Heavy task
result1 = manager.generate('mistral:q4_k_m', 'Explain quantum computing')

# Quick task (same model, cached)
result2 = manager.generate('mistral:q4_k_m', 'What is 2+2?')

# Different model (waits if needed, or switches if 60s passed)
result3 = manager.generate('neural-chat:latest', 'Classify sentiment')

The Honest Trade-offs

Strategy	VRAM	Speed	Complexity	Best For
CPU offload	2–4 GB	Slow (10–50 tok/s)	Low	Single large model on tiny GPU
Smaller quants	5–6 GB	Fast (100+ tok/s)	Very Low	One model, acceptable quality loss
Multiple instances	8–16 GB	Very Fast	Medium	Load balancing across tasks
Caching	Varies	Instant (cache hit)	Low	Repeated queries (RAG)
Aggressive keep-alive	4–6 GB	Normal	Low	Sequential model switching

What Actually Works

For most people with 8–16GB VRAM:

Keep one model loaded (Q4 quantization, ~4–5 GB)
Use CPU offload if you need a second model (set OLLAMA_NUM_GPU=20, accept slower speeds)
Cache everything (embeddings, common queries)
Tune keep-alive to 30–60 seconds (balance speed vs. VRAM reuse)

You’re not building a massive multi-model inference cluster. You’re running a chatbot on a gaming laptop. Be pragmatic.

Your 2 AM self will appreciate it when the setup is simple and just works.