You’ve run ollama run llama3 at least once. Maybe you’ve tried Mistral or grabbed an exotic quantization variant. Now what? Most tutorials stop at the hello-world moment—pulling a model and chatting with it in the terminal. Real Ollama work happens when you stop treating it like a toy and start building around it.
Here’s the thing: Ollama is a production-grade LLM runtime that most people use like a video game. Let’s fix that.
The Basics (In 30 Seconds)
Before we go deeper, the recap: ollama pull llama3 grabs the model, ollama run llama3 starts it in interactive chat mode, and ollama list shows what you’ve got. You already know this. Moving on.
Modelfiles: Your Custom Assistant
Every model in Ollama is actually a Modelfile—a blueprint that defines the base model, system prompt, and behavioral parameters. You can see Ollama’s built-in ones, but the real power is writing your own.
Here’s a practical example: a coding assistant that’s opinionated about error messages.
FROM mistral:latest
SYSTEM """You are a senior software engineer reviewing someone's code.Be direct and sarcastic. Point out inefficiencies without being rude.Always suggest specific fixes, not vague improvements.Prefer simplicity over cleverness.If the code is fine, say so—don't invent problems."""
PARAMETER temperature 0.3PARAMETER top_p 0.8PARAMETER top_k 40Then:
$ ollama create code-review -f Modelfile$ ollama run code-reviewNow you’ve got a custom model with the exact behavior you want. No tweaking prompts in your chat interface—it’s baked in.
Key Modelfile directives:
FROM— Base model (required)SYSTEM— System prompt injected on every conversationPARAMETER— Temperature, top_p, top_k, context length, etc.TEMPLATE— Advanced: customize the prompt template format itself
GPU Layer Management: When Your Model Doesn’t Fit
By default, Ollama tries to load the entire model into VRAM. If it doesn’t fit, it offloads layers to RAM—which feels like failure but actually works.
The num_gpu parameter controls how many layers Ollama pushes to the GPU:
FROM llama3:70b
PARAMETER num_gpu 20What does this do? It moves 20 layers to VRAM and the rest stays in system RAM. Inference is slower (RAM access is measured in milliseconds, not nanoseconds), but the model still runs.
The practical math:
- Llama 3 7B: ~14 GB total, fits on a 24 GB card
- Llama 3 70B: ~140 GB total, needs layer offloading on most consumer hardware
- Check
nvidia-smiduring inference to watch which layers are on GPU
If you’re running tight on VRAM, reduce num_gpu. If you have headroom, maximize it. Test with ollama run and watch GPU memory in another terminal.
Quantization Tradeoffs: Q4, Q8, and Float16
You’ve seen model names like llama3:latest and llama3:70b-q4_k_m. That suffix isn’t flavor text—it’s quantization, and it’s the reason your 70B model fits on a laptop.
Q4_K_M (4-bit, medium) — Most popular, best quality-to-size ratio. 7B model ≈ 4–5 GB. Good for most use cases.
Q8_0 (8-bit) — Almost no quality loss, takes ~2x the space of Q4. Use when VRAM is plentiful and you care about accuracy.
F16 (float16) — Unquantized, full precision. ~2x Q8 size. Use if you’re fine-tuning or need research-grade outputs.
IQ1_M / IQ2_XXS (extreme) — Insane compression, ~1–2 GB for 7B models, but noticeable quality drop. Good for experimentation, bad for production.
Pull a model in multiple quantizations and benchmark it on your use case:
$ ollama pull mistral:latest # (q4_k_m by default)$ ollama pull mistral:7b-q8_0$ time ollama run mistral "Explain Docker in 2 sentences"$ time ollama run mistral:7b-q8_0 "Explain Docker in 2 sentences"Measure speed and output quality. Q4 is the Goldilocks zone for most people.
The REST API: Building Around Ollama
Everything you do in the terminal goes through Ollama’s REST API. Use it to build real workflows.
Basic generate (non-streaming):
import requestsimport json
response = requests.post( "http://localhost:11434/api/generate", json={ "model": "mistral", "prompt": "What's the capital of France?", "stream": False, },)result = response.json()print(result["response"])Chat with conversation history:
response = requests.post( "http://localhost:11434/api/chat", json={ "model": "mistral", "messages": [ {"role": "user", "content": "You're a Docker expert. What's the difference between CMD and ENTRYPOINT?"}, {"role": "assistant", "content": "[previous response]"}, {"role": "user", "content": "Give me an example with Dockerfile syntax."}, ], },)print(response.json()["message"]["content"])Streaming for real-time output:
response = requests.post( "http://localhost:11434/api/generate", json={"model": "mistral", "prompt": "Write a haiku about containers", "stream": True}, stream=True,)for line in response.iter_lines(): chunk = json.loads(line) print(chunk["response"], end="", flush=True)This is how you hook Ollama into web apps, Discord bots, CI/CD pipelines, or anything else.
Running Multiple Models Without OOM Hell
Ollama keeps the active model in memory. Switch models, and the current one unloads (usually). But what if you want two models running simultaneously?
Check what’s loaded:
$ ollama psNAME ID SIZE PROCESSORmistral:latest d1234567890ab 4.2 GB GPUllama3:8b a5678901234cd 8.5 GB CPUEach model occupies its space. Simple arithmetic: 4.2 GB + 8.5 GB = 12.7 GB. If your VRAM is 24 GB, you’re fine. If it’s 8 GB, one model stays in RAM.
Control how long Ollama keeps a model in memory:
$ ollama run mistral "list Docker commands" --keep-alive 10mThis keeps Mistral in memory for 10 minutes after the request finishes. Set to 0 to unload immediately.
Pro tip: If you’re building a service that switches between models, use the keep-alive parameter carefully. Long timeouts = higher peak memory. Short timeouts = slower response times on the next request.
Context Length: The Deceptive Limit
Models have a context window—the number of tokens they can remember in a conversation. Llama 3 is 8K tokens by default. That sounds like a lot until you realize that’s roughly 6,000 words, and you’ve already used half just loading the conversation history and system prompt.
Extend context length in a Modelfile:
FROM mistral:latestPARAMETER num_ctx 16384But here’s the catch: longer context = slower inference and more VRAM usage. A 32K context window might be 1.5–2x slower than 8K. Only extend when you genuinely need it (code review of a large file, summarizing a long document).
Integrating Ollama Into Real Tools
Open WebUI — Polished chat UI that feels like ChatGPT. Drop-in replacement for the terminal:
$ docker run -p 3000:8080 \ -e OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api \ ghcr.io/open-webui/open-webui:latestVisit localhost:3000, connect to your Ollama instance, and you’ve got a modern interface with chat history, model switching, and prompt management.
Continue.dev — AI coding assistant inside VS Code. Configure it to use your local Ollama instead of OpenAI:
Add to your VS Code settings.json:
"continue.modelProvider": "ollama","continue.modelConfig": { "model": "mistral", "apiBase": "http://localhost:11434"}Highlight code, ask questions, get completions from your own hardware.
System Prompts That Actually Matter
Your system prompt shapes everything. Here are three that work in the real world:
Structured output (JSON):
You are a JSON generator. Respond ONLY with valid JSON, no other text.Use this when piping output to scripts. No more parsing text hallucinations.
Summarization:
You are a ruthless editor. Distill the key points into 3–5 bullets.Ignore filler. Be specific. Include numbers if present.Roleplay (code review):
You are a senior engineer who has been burned by technical debt before.Review this code for maintainability, not just correctness.Point out what will hurt in 6 months, not what's broken today.System prompts are free—make them specific to what you’re building.
One More Thing
Ollama is a tool that rewards depth. Most people stay at the surface (ollama run, pick a model, chat). The moment you start writing Modelfiles, hitting the REST API, and managing GPU layers, it becomes powerful. Your self-hosted LLM isn’t a toy anymore—it’s infrastructure.
Your 2 AM self will appreciate not having to pay OpenAI’s API bills.