Ollama Model Management: Beyond ollama run
You know how to pull and run a model. Now learn Modelfiles, GPU layer tuning, the REST API, running multiple models without OOM-killing your server, and actually useful system prompts.
All the articles with the tag "ollama".
You know how to pull and run a model. Now learn Modelfiles, GPU layer tuning, the REST API, running multiple models without OOM-killing your server, and actually useful system prompts.
Most people use OpenAI's embeddings because it's easy. But local embeddings exist. How to pick and when it actually matters.
Before you download a 70B model, calculate if it fits. The formulas, the gotchas, and a quick calculator you can actually use.
Google's Gemma 4 is the best open model they've shipped yet. Here's how to pull it, run it, and actually use it for real work with Ollama on your own hardware.
vLLM, llama.cpp, and Ollama all run local LLMs — compare throughput, memory use, GPU support, and which fits your hardware.
Ollama can load one model at a time on limited hardware. How to switch between models, use CPU offloading, and manage VRAM intelligently.
Why your GPU fills up with Ollama. How to inspect VRAM, tune keep-alive, force-unload models with a single request, and stop the reload pain in 2026.
Connect n8n to Ollama or any local LLM to build smart automations that classify, summarize, and triage — not just shuffle data around blindly.
Master Ollama with Modelfiles, GPU tuning, API usage, and performance tricks. Stop running 70B models on 8GB VRAM and wondering why everything is slow.
Temperature, top-p, top-k, context length — LLM inference parameters explained so you stop guessing why the model gives weird output.
GGUF, GGML, AWQ, GPTQ — LLM file formats and quantization levels explained: trade-offs between model quality, size, and inference speed.
LLaMA, Mistral, Falcon, GPT — the LLM landscape is crowded. Compare model families, sizes, licensing, and what each is actually good for.