The Problem
You want to use Mistral for one task, Llama for another. Ollama loads one model at a time. Switch between them, and the first unloads (eventually). But if both are large, VRAM fills up.
On 8GB VRAM, you can’t realistically run two 7B models in parallel. You need strategies.
Strategy 1: CPU Offloading (The Slowest But Works)
Let GPU handle part of the model, CPU handles the rest.
# Set environment variable before starting OllamaOLLAMA_NUM_GPU=35 ollama serve
# Or for specific models# OLLAMA_NUM_GPU=25 ollama run mistralOLLAMA_NUM_GPU=35 means “move 35 layers to GPU, rest to CPU.”
This isn’t a magic bullet—CPU inference is 10–100x slower than GPU. But it allows running larger models on small GPUs.
# Example: 7B model with Q4 quantization# Full VRAM: 4 GB# GPU offload 50%: 2 GB VRAM + CPU fills the gap# Inference speed: 10–20 tokens/sec instead of 100+ tokens/secStrategy 2: Use Smaller Quantizations
Switch between Q3 and Q5 instead of switching between models.
Mistral 7B Q5: 5.5 GBMistral 7B Q4: 3.5 GBMistral 7B Q3: 2.5 GBIf you need two models simultaneously, could you use the same model at different quantizations? Or one full model + one Q3 version?
# Load Mistral Q4 for main workollama pull mistral:q4_k_m
# Also pull Q3 for backup/lightweight tasksollama pull mistral:q3_k_m
# In your API, switch between them based on the taskif task == "complex_reasoning": model = "mistral:q4_k_m"else: model = "mistral:q3_k_m"Strategy 3: Run Multiple Ollama Instances on Different Ports
Each instance manages its own VRAM separately.
# Terminal 1: Heavy tasks on GPUOLLAMA_HOST=127.0.0.1:11434 ollama serve
# Terminal 2: Light tasks (CPU offload or small models)OLLAMA_HOST=127.0.0.1:11435 OLLAMA_NUM_GPU=0 ollama serve
# Terminal 3: Medium tasks (partial offload)OLLAMA_HOST=127.0.0.1:11436 OLLAMA_NUM_GPU=20 ollama serveNow distribute work:
import requests
def call_ollama(task_type, prompt): ports = { 'heavy': 11434, 'light': 11435, 'medium': 11436 }
port = ports.get(task_type, 11434) response = requests.post(f'http://127.0.0.1:{port}/api/generate', json={ 'model': 'mistral', 'prompt': prompt, 'stream': False }) return response.json()['response']
# Routes requests to different instancesresult = call_ollama('heavy', 'Complex reasoning task')result = call_ollama('light', 'Simple classification')Each instance unloads models independently. Great for balancing load.
Strategy 4: Pre-calculate and Cache
Don’t run inference on every request. Cache results.
from functools import lru_cacheimport hashlib
@lru_cache(maxsize=1000)def generate_embedding_cached(text): # Ollama embedding doesn't use much VRAM # Cache prevents re-computing response = requests.post('http://127.0.0.1:11434/api/embeddings', json={ 'model': 'nomic-embed-text:v1.5', 'prompt': text }) return tuple(response.json()['embedding'])
# First call: computeembedding1 = generate_embedding_cached("How do I use Docker?")
# Second call: cache hit (instant)embedding2 = generate_embedding_cached("How do I use Docker?")This works great for RAG—you embed documents once, reuse forever.
Strategy 5: The Keep-Alive Dance
Tune OLLAMA_KEEP_ALIVE to unload models more aggressively.
# Keep models in VRAM for only 30 secondsOLLAMA_KEEP_ALIVE=30s ollama serveCombined with request batching, you can simulate multi-model support:
# Pseudo-code workflow:# 1. Run Mistral inference (loaded)# 2. Query immediately (cache hit)# 3. Switch to Llama (Mistral unloads after 30s)# 4. Query Llama# 5. Mistral is now unloaded, freeing 4GB VRAMPractical Example: API with Smart Load Balancing
import requestsimport timefrom collections import defaultdict
class OllamaMultiManager: def __init__(self, base_url='http://127.0.0.1:11434', keep_alive=60): self.base_url = base_url self.keep_alive = keep_alive self.model_loads = defaultdict(float) self.current_model = None
def should_switch(self, requested_model): """Decide if we should switch models.""" if self.current_model is None: return True
# If same model, keep it loaded if self.current_model == requested_model: return False
# If time since last load > keep_alive, model is unloaded time_since_load = time.time() - self.model_loads[self.current_model] if time_since_load > self.keep_alive: return True
# Otherwise, decide based on priority return requested_model.startswith('priority_')
def generate(self, model, prompt): """Generate with smart model switching.""" if self.should_switch(model): print(f"Switching to {model}") self.current_model = model
self.model_loads[model] = time.time()
response = requests.post(f'{self.base_url}/api/generate', json={ 'model': model, 'prompt': prompt, 'stream': False }) return response.json()['response']
# Usagemanager = OllamaMultiManager(keep_alive=60)
# Heavy taskresult1 = manager.generate('mistral:q4_k_m', 'Explain quantum computing')
# Quick task (same model, cached)result2 = manager.generate('mistral:q4_k_m', 'What is 2+2?')
# Different model (waits if needed, or switches if 60s passed)result3 = manager.generate('neural-chat:latest', 'Classify sentiment')The Honest Trade-offs
| Strategy | VRAM | Speed | Complexity | Best For |
|---|---|---|---|---|
| CPU offload | 2–4 GB | Slow (10–50 tok/s) | Low | Single large model on tiny GPU |
| Smaller quants | 5–6 GB | Fast (100+ tok/s) | Very Low | One model, acceptable quality loss |
| Multiple instances | 8–16 GB | Very Fast | Medium | Load balancing across tasks |
| Caching | Varies | Instant (cache hit) | Low | Repeated queries (RAG) |
| Aggressive keep-alive | 4–6 GB | Normal | Low | Sequential model switching |
What Actually Works
For most people with 8–16GB VRAM:
- Keep one model loaded (Q4 quantization, ~4–5 GB)
- Use CPU offload if you need a second model (set
OLLAMA_NUM_GPU=20, accept slower speeds) - Cache everything (embeddings, common queries)
- Tune keep-alive to 30–60 seconds (balance speed vs. VRAM reuse)
You’re not building a massive multi-model inference cluster. You’re running a chatbot on a gaming laptop. Be pragmatic.
Your 2 AM self will appreciate it when the setup is simple and just works.