Ollama Model Management: Beyond ollama run

You’ve run ollama run llama3 at least once. Maybe you’ve tried Mistral or grabbed an exotic quantization variant. Now what? Most tutorials stop at the hello-world moment, pulling a model and chatting with it in the terminal. Real Ollama work happens when you stop treating it like a toy and start building around it.

Ollama is a production-grade LLM runtime that most people use like a video game. Let’s fix that.

The Basics (In 30 Seconds)

Before we go deeper, the recap: ollama pull llama3 grabs the model, ollama run llama3 starts it in interactive chat mode, and ollama list shows what you’ve got. You already know this. Moving on.

Modelfiles: Your Custom Assistant

Every model in Ollama is actually a Modelfile, a blueprint that defines the base model, system prompt, and behavioral parameters. You can see Ollama’s built-in ones, but the real power is writing your own.

Here’s a practical example: a coding assistant that’s opinionated about error messages.

FROM mistral:latest

SYSTEM """You are a senior software engineer reviewing someone's code.
Be direct and sarcastic. Point out inefficiencies without being rude.
Always suggest specific fixes, not vague improvements.
Prefer simplicity over cleverness.
If the code is fine, say so—don't invent problems."""

PARAMETER temperature 0.3
PARAMETER top_p 0.8
PARAMETER top_k 40

Then:

$ ollama create code-review -f Modelfile
$ ollama run code-review

Now you’ve got a custom model with the exact behavior you want. No tweaking prompts in your chat interface, it’s baked in.

Key Modelfile directives:

FROM: Base model (required)
SYSTEM: System prompt injected on every conversation
PARAMETER: Temperature, top_p, top_k, context length, etc.
TEMPLATE: Advanced: customize the prompt template format itself

GPU Layer Management: When Your Model Doesn’t Fit

By default, Ollama tries to load the entire model into VRAM. If it doesn’t fit, it offloads layers to RAM, which feels like failure but actually works.

The num_gpu parameter controls how many layers Ollama pushes to the GPU:

FROM llama3:70b

PARAMETER num_gpu 20

What does this do? It moves 20 layers to VRAM and the rest stays in system RAM. Inference is slower (RAM access is measured in milliseconds, not nanoseconds), but the model still runs.

The practical math:

Llama 3 7B: ~14 GB total, fits on a 24 GB card
Llama 3 70B: ~140 GB total, needs layer offloading on most consumer hardware
Check nvidia-smi during inference to watch which layers are on GPU

If you’re running tight on VRAM, reduce num_gpu. If you have headroom, maximize it. Test with ollama run and watch GPU memory in another terminal.

Quantization Tradeoffs: Q4, Q8, and Float16

You’ve seen model names like llama3:latest and llama3:70b-q4_k_m. That suffix isn’t flavor text, it’s quantization, and it’s the reason your 70B model fits on a laptop.

Q4_K_M (4-bit, medium), Most popular, best quality-to-size ratio. 7B model ≈ 4 to 5 GB. Good for most use cases.

Q8_0 (8-bit), Almost no quality loss, takes ~2x the space of Q4. Use when VRAM is plentiful and you care about accuracy.

F16 (float16), Unquantized, full precision. ~2x Q8 size. Use if you’re fine-tuning or need research-grade outputs.

IQ1_M / IQ2_XXS (extreme), Insane compression, ~1 to 2 GB for 7B models, but noticeable quality drop. Good for experimentation, bad for production.

Pull a model in multiple quantizations and benchmark it on your use case:

$ ollama pull mistral:latest       # (q4_k_m by default)
$ ollama pull mistral:7b-q8_0
$ time ollama run mistral "Explain Docker in 2 sentences"
$ time ollama run mistral:7b-q8_0 "Explain Docker in 2 sentences"

Measure speed and output quality. Q4 is the Goldilocks zone for most people.

The REST API: Building Around Ollama

Everything you do in the terminal goes through Ollama’s REST API. Use it to build real workflows.

Basic generate (non-streaming):

import requests
import json

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "mistral",
        "prompt": "What's the capital of France?",
        "stream": False,
    },
)
result = response.json()
print(result["response"])

Chat with conversation history:

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "mistral",
        "messages": [
            {"role": "user", "content": "You're a Docker expert. What's the difference between CMD and ENTRYPOINT?"},
            {"role": "assistant", "content": "[previous response]"},
            {"role": "user", "content": "Give me an example with Dockerfile syntax."},
        ],
    },
)
print(response.json()["message"]["content"])

Streaming for real-time output:

response = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "mistral", "prompt": "Write a haiku about containers", "stream": True},
    stream=True,
)
for line in response.iter_lines():
    chunk = json.loads(line)
    print(chunk["response"], end="", flush=True)

This is how you hook Ollama into web apps, Discord bots, CI/CD pipelines, or anything else.

Running Multiple Models Without OOM Hell

Ollama keeps the active model in memory. Switch models, and the current one unloads (usually). But what if you want two models running simultaneously?

Check what’s loaded:

$ ollama ps
NAME                ID              SIZE      PROCESSOR
mistral:latest      d1234567890ab   4.2 GB    GPU
llama3:8b           a5678901234cd   8.5 GB    CPU

Each model occupies its space. Simple arithmetic: 4.2 GB + 8.5 GB = 12.7 GB. If your VRAM is 24 GB, you’re fine. If it’s 8 GB, one model stays in RAM.

Control how long Ollama keeps a model in memory:

$ ollama run mistral --keepalive 10m "list Docker commands"

This keeps Mistral in memory for 10 minutes after the request finishes. Set to 0 to unload immediately.

Pro tip: If you’re building a service that switches between models, use the keep-alive parameter carefully. Long timeouts = higher peak memory. Short timeouts = slower response times on the next request.

Context Length: The Deceptive Limit

Models have a context window, the number of tokens they can remember in a conversation. Llama 3 is 8K tokens by default. That sounds like a lot until you realize that’s roughly 6,000 words, and you’ve already used half just loading the conversation history and system prompt.

Extend context length in a Modelfile:

FROM mistral:latest
PARAMETER num_ctx 16384

But here’s the catch: longer context = slower inference and more VRAM usage. A 32K context window might be 1.5 to 2x slower than 8K. Only extend when you genuinely need it (code review of a large file, summarizing a long document).

Integrating Ollama Into Real Tools

Open WebUI, Polished chat UI that feels like ChatGPT. Drop-in replacement for the terminal:

$ docker run -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:latest

Visit localhost:3000, connect to your Ollama instance, and you’ve got a modern interface with chat history, model switching, and prompt management.

Continue.dev, AI coding assistant inside VS Code. Configure it to use your local Ollama instead of OpenAI:

Add to your ~/.continue/config.yaml:

models:
  - name: Mistral (local)
    provider: ollama
    model: mistral
    apiBase: http://localhost:11434

Highlight code, ask questions, get completions from your own hardware.

System Prompts That Actually Matter

Your system prompt shapes everything. Here are three that work in the real world:

Structured output (JSON):

You are a JSON generator. Respond ONLY with valid JSON, no other text.

Use this when piping output to scripts. No more parsing text hallucinations.

Summarization:

You are a ruthless editor. Distill the key points into 3–5 bullets.
Ignore filler. Be specific. Include numbers if present.

Roleplay (code review):

You are a senior engineer who has been burned by technical debt before.
Review this code for maintainability, not just correctness.
Point out what will hurt in 6 months, not what's broken today.

System prompts are free, make them specific to what you’re building.

One More Thing

Ollama is a tool that rewards depth. Most people stay at the surface (ollama run, pick a model, chat). The moment you start writing Modelfiles, hitting the REST API, and managing GPU layers, it becomes powerful. Your self-hosted LLM isn’t a toy anymore, it’s infrastructure.

Your 2 AM self will appreciate not having to pay OpenAI’s API bills.

Ollama Model Management: Beyond ollama run

The Basics (In 30 Seconds)

Modelfiles: Your Custom Assistant

GPU Layer Management: When Your Model Doesn’t Fit

Quantization Tradeoffs: Q4, Q8, and Float16

The REST API: Building Around Ollama

Running Multiple Models Without OOM Hell

Context Length: The Deceptive Limit

Integrating Ollama Into Real Tools

System Prompts That Actually Matter

One More Thing

Responses from around the web

Discussion

Related Posts

Self-Host a Local AI Coding Workhorse

Ollama: Powerful Language Models on Your Own Machine

KV Cache Quantization: Free LLM Context, Almost

Mixture of Experts (MoE) for Self-Hosters, Demystified

Ollama Model Management: Beyond ollama run

The Basics (In 30 Seconds)

Modelfiles: Your Custom Assistant

GPU Layer Management: When Your Model Doesn’t Fit

Quantization Tradeoffs: Q4, Q8, and Float16

The REST API: Building Around Ollama

Running Multiple Models Without OOM Hell

Context Length: The Deceptive Limit

Integrating Ollama Into Real Tools

System Prompts That Actually Matter

One More Thing

Related Reading

Responses from around the web

Discussion

Related Posts

Self-Host a Local AI Coding Workhorse

Ollama: Powerful Language Models on Your Own Machine

KV Cache Quantization: Free LLM Context, Almost

Mixture of Experts (MoE) for Self-Hosters, Demystified