Running Gemma 4 Locally with Ollama

Google finally shipped an open model that doesn’t feel like a compromise. Gemma 4 is here, and if you’ve got a spare 16GB of VRAM and Ollama installed, you’re about thirty seconds away from having a genuinely competent AI assistant running entirely on your own hardware. No API calls, no phone-home telemetry, no waiting for rate limits to cool down at 2 AM.

This is the sweet spot you’ve been waiting for.

What Is Gemma 4, Anyway?

Gemma is Google’s family of open-source models, and Gemma 4 is their latest iteration. Unlike their previous releases, which always felt like they were playing catch-up, Gemma 4 actually trades blows with Llama 3.3 and Qwen 2.5 on most benchmarks. The 27B variant is the Goldilocks zone: smart enough to handle reasoning and coding, small enough to fit comfortably on gaming hardware.

Google also released multimodal variants, which means you can feed it images and get analysis back. No separate vision model required. Just one Modelfile, and you’ve got text + vision in one go.

The kicker? It’s genuinely open. No weird licensing gotchas. No “research use only” asterisks. You run it however you want.

Hardware: What Do You Actually Need?

Here’s the breakdown, and I’ll be straight with you—these are the numbers that matter:

Gemma 4 27B (Q4 quantization): ~16GB VRAM. This is your main event. Runs smooth on RTX 4060 Ti or a decent home lab GPU.
Gemma 4 9B: ~6–8GB VRAM. Fits on most modern laptops. Still sharp for general use.
Gemma 4 2B: ~2GB VRAM. Mobile-tier. Useful for edge devices and embedded systems.

The numbers assume you’re using 4-bit quantization (Q4_K_M in GGUF parlance), which is the default. Full precision will eat more VRAM—we’re not going there today.

CPU inference is possible but glacially slow. You want GPU acceleration here. If you don’t have one, Ollama can still run it on CPU, but you’ll be waiting 3–5 seconds per token. Not fun for actual work.

Getting Gemma 4 Running: Three Commands

Install Ollama if you haven’t already (it’s at ollama.ai). Then:

ollama pull gemma4:27b

That pulls the 27B variant. Ollama’s naming is straightforward: smaller models get smaller tags. If you want the 9B, use gemma4:9b. If you want the multimodal version, use gemma4:27b-vision.

Once it’s downloaded (this’ll take a few minutes depending on your bandwidth), run it:

ollama run gemma4:27b

You’re now in an interactive session. Type prompts, get responses. It’s that simple.

The Comparison: How Does It Stack Up?

I ran the same prompt through Gemma 4 27B, Llama 3.3 70B, and Qwen 2.5 32B. Here’s what I found:

Coding: Gemma 4 27B keeps pace with Llama 3.3 70B on most tasks. For Python and TypeScript, it’s solid. C++ gets a little fuzzy, but nothing catastrophic. Qwen 2.5 still edges it out slightly on math-heavy logic problems.

Instruction following: Gemma 4 is absurdly good at parsing weird requests and getting the intent right. It respects constraints better than the others—if you tell it to keep an answer under 100 words, it actually does.

Reasoning: Here’s where Llama 3.3 still has the edge with its full 70B. Gemma 4 27B is competent but won’t win shootouts on chain-of-thought problems. It’s still better than most 13B models though.

Speed: Because it’s 27B instead of 70B, it’s faster to generate responses. You’ll notice it in real use—tighter feedback loop.

The real story: Gemma 4 27B is the model you actually run and use. Llama 3.3 70B is the one you dream about but can’t afford the VRAM for. Gemma wins on practicality.

Actually Using It: Three Paths

Path 1: Interactive CLI (What We Just Did)

ollama run gemma4:27b
>>> Write me a Python function that validates email addresses without regex.

Instant responses, no setup. Good for quick questions and learning.

Path 2: Open WebUI (The Comfortable Option)

Open WebUI is a Ollama-compatible web interface. Install it, point it at your local Ollama server, and you get a ChatGPT-like interface.

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:latest

Visit http://localhost:3000, select gemma4:27b, and you’ve got a web UI. Conversation history, message editing, everything you’d expect. This is how I actually use it day-to-day.

Path 3: API Calls (The Programmatic Route)

Ollama exposes a REST API on port 11434. You can curl it:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:27b",
  "prompt": "Explain Docker volumes in one paragraph",
  "stream": false
}'

The stream: false flag waits for the full response. If you want it token-by-token (useful for real-time UIs), set stream: true and parse the JSONL output.

Multimodal: Feeding It Images

Got the vision variant? You can pass base64-encoded images:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:27b-vision",
  "prompt": "What is in this image?",
  "images": ["iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg=="],
  "stream": false
}'

That’s a 1x1 PNG encoded as base64. In practice, you’d read your actual image, run it through base64, and pass it. The model will describe what it sees. It’s not perfect—don’t expect it to OCR a spreadsheet—but for general image understanding, it works.

Customizing It: Modelfile Magic

You can customize Gemma 4’s behavior with a Modelfile. Create a file called Modelfile (no extension):

FROM gemma4:27b

SYSTEM You are a helpful coding assistant specializing in Python and Go. Be direct and concise. Avoid fluff.

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

That last line bumps the context window from the default 2048 to 4096 tokens. Now build it:

ollama create my-gemma4 -f Modelfile
ollama run my-gemma4

You’ve now got a custom version with a tailored system prompt and higher context length. Useful for specialized tasks.

What to Expect: Real-World Feel

Gemma 4 27B will give you responses that feel natural and informed. It won’t hallucinate wildly. It handles edge cases better than you’d expect from a 27B model. When it doesn’t know something, it actually says so instead of confidently making stuff up.

For personal projects, coding help, writing, and research—it’s legitimately competent. You’re not making do with a compromise. You’re getting genuine AI capability running entirely on your own hardware, no phone home, no API bills, no waiting for the provider’s servers to cool down.

That’s worth the 16GB VRAM investment.

One Last Thing

Ollama has a health-check endpoint at http://localhost:11434/api/tags that lists what models you’ve got loaded. Useful for scripting and monitoring. And if you want to serve it on your network instead of just localhost, use:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Now any machine on your network can hit your local LLM. Welcome to your own private AI infrastructure.

Pull Gemma 4, run it tonight, and remember: this used to require a six-figure server budget. Now it just requires patience for the download.

Running Gemma 4 Locally with Ollama

What Is Gemma 4, Anyway?

Hardware: What Do You Actually Need?

Getting Gemma 4 Running: Three Commands

The Comparison: How Does It Stack Up?

Actually Using It: Three Paths

Path 1: Interactive CLI (What We Just Did)

Path 2: Open WebUI (The Comfortable Option)

Path 3: API Calls (The Programmatic Route)

Multimodal: Feeding It Images

Customizing It: Modelfile Magic

What to Expect: Real-World Feel

One Last Thing

Responses from around the web

Discussion

Related Posts

LLM Backends: vLLM vs llama.cpp vs Ollama

Key Parameters of Large Language Models

Large Language Model Formats and Quantization

Exploring the Diverse World of LLM Models