Your GPU is Sitting There. Your AI Model is Waiting. What Could Go Wrong?
Quite a bit, actually. But let’s back up.
Local AI inference has gone from “cool party trick” to “genuinely useful thing you can run on your home server” in about two years. Tools like Ollama, llama.cpp, and LocalAI have made it so that running a 7B or 13B language model locally is within reach of anyone with a halfway decent machine. The catch? GPU acceleration is where things get complicated, and “complicated” is doing a lot of heavy lifting in that sentence.
So let’s talk about CUDA, ROCm, and CPU inference — what they are, how to set them up in Docker, and which path will leave you with hair still attached to your head.
Why Your GPU Even Matters
Before diving into the three paths, it’s worth understanding why you’d bother with GPU acceleration at all.
When you run an LLM inference workload, the bottleneck is almost entirely memory bandwidth and parallel compute. A modern GPU has thousands of shader cores that can crunch matrix multiplications in parallel — the exact kind of math that transformers love. Your CPU, by contrast, has maybe 16-32 cores that are individually more powerful but collectively much slower for this workload.
In practice, this means:
- CPU only (llama.cpp, 7B model): ~5–15 tokens/second on a modern machine
- NVIDIA 3090 (24GB VRAM): ~50–80 tokens/second
- NVIDIA 4090 (24GB VRAM): ~80–130 tokens/second
- AMD RX 7900 XTX (24GB VRAM): ~40–70 tokens/second (on a good day, with ROCm cooperating)
Tokens per second matters because it’s the difference between watching words trickle out painfully and having something that feels responsive. At 5 tokens/sec, a 200-word response takes almost a minute. At 80 tokens/sec, it’s instantaneous.
Path 1: NVIDIA + CUDA — The Easy Path (Relatively Speaking)
NVIDIA has been the de facto standard for GPU compute for over a decade. CUDA is their proprietary parallel computing platform, and the entire AI/ML ecosystem is built around it. PyTorch, TensorFlow, basically every framework you’ve heard of — CUDA first, everything else eventually.
Getting CUDA Working in Docker
The key piece here is the nvidia-container-toolkit. This is what lets Docker containers access your host GPU without you doing anything too cursed.
Step 1: Install the toolkit on your host
# On Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 2: Verify it works
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi
If you see your GPU listed, you’re golden.
Docker Compose for Ollama (NVIDIA)
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:
That’s it. Pull a model with docker exec ollama ollama pull llama3.2 and you’re running AI locally with full GPU acceleration. Honestly, it’s almost too easy — suspicious, even.
What GPUs Work Well
- RTX 3090 / 4090: Both have 24GB VRAM, excellent for 13B–34B models. The 4090 is significantly faster but costs significantly more.
- RTX 3080 (10GB): Fine for 7B models, starts sweating with 13B.
- RTX 4070 (12GB): Sweet spot for the price if you’re not running massive models.
- Datacenter cards (A100, H100): Yes they work, no I’m not jealous, I’m fine.
Path 2: AMD + ROCm — Technically Works, Emotionally Draining
Let me be real with you. ROCm (Radeon Open Compute) has come a long way. It’s no longer a complete disaster. It is, however, still a “check the support matrix three times before buying” situation.
AMD’s open-source GPU compute stack is genuinely impressive from an engineering standpoint. The problem is that “supported” and “actually runs inference well” aren’t always the same thing.
Which Cards Are Actually Supported
ROCm officially supports a subset of AMD GPUs. The consumer-focused RX 7000 series support improved dramatically with ROCm 5.7+, but it’s not universal. As of early 2026:
- RX 7900 XTX / 7900 XT: Supported, works reasonably well
- RX 7800 XT / 7700 XT: Mixed results, YMMV
- RX 6000 series: Supported but older, some quirks
- RX 5000 series and below: Officially dropped or community-only support
The real gotcha: ROCm support on consumer cards sometimes lags behind the datacenter cards (MI250, MI300) by months. If you’re running an RX 7900 XTX and something doesn’t work, check if the fix exists for MI250 first — there’s a decent chance you can adapt it.
Docker Compose for Ollama (AMD/ROCm)
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama-amd
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
- render
security_opt:
- seccomp:unconfined
environment:
- HSA_OVERRIDE_GFX_VERSION=11.0.0 # May be needed for RX 7000 series
volumes:
ollama_data:
Note that HSA_OVERRIDE_GFX_VERSION line. You might need it. You might not. Welcome to ROCm, where the answer is always “depends on your specific card and kernel version.”
Host Setup for ROCm
# Add your user to the required groups
sudo usermod -aG video,render $USER
# Install ROCm (Ubuntu)
sudo apt install rocm-hip-runtime rocm-dev
# Verify
rocminfo | grep -A5 "Agent 2"
The Honest ROCm Experience
AMD GPU support is like that one friend who’s technically reliable but makes everything just slightly more complicated than it needs to be. The container image is ollama/ollama:rocm instead of the default. You need device passthrough instead of the clean NVIDIA runtime. The HSA_OVERRIDE_GFX_VERSION environment variable exists and you will need to Google which value to set.
But — and this is important — it does work. The RX 7900 XTX with 24GB VRAM at a lower price point than a 4090 is genuinely compelling if you’re willing to do the homework. Performance on inference workloads is roughly 60–75% of an equivalent NVIDIA card, which is acceptable for home use.
Path 3: CPU Inference — Surprisingly Not Terrible
Here’s a hot take: CPU inference in 2025-2026 is actually usable for smaller models.
llama.cpp (which underpins Ollama) has outstanding CPU optimization, including AVX2/AVX512 SIMD instructions, multi-threading, and aggressive quantization support. A modern 12-core CPU can run a 7B model at 8-15 tokens/second. That’s slow by GPU standards, but it’s not unusable.
When CPU Makes Sense
- You don’t have a supported GPU
- You’re running smaller models (1B–7B)
- You’re doing batch processing where latency doesn’t matter
- Your GPU doesn’t have enough VRAM to fit the model (split layers between GPU and CPU is also an option)
CPU-Optimized Docker Compose
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-cpu
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_NUM_PARALLEL=2
- OLLAMA_MAX_LOADED_MODELS=1
volumes:
ollama_data:
No GPU flags needed — Ollama will just use CPU if no GPU is available. The OLLAMA_NUM_PARALLEL setting limits concurrent requests so you don’t thrash your CPU serving multiple requests at once.
Quantization is Your Friend
Running a 7B model in full float16 on CPU is painful. Running it in Q4_K_M quantization is… actually fine? Quantized models use 4-bit weights with some clever rounding, which cuts memory usage by 4x with minimal quality loss for most tasks. Ollama handles this automatically when you pull a model — the defaults are sensible.
The Practical Decision Matrix
Let’s cut through it:
| Situation | Recommendation |
|---|---|
| Have NVIDIA GPU | Use CUDA, follow the Docker Compose above, done |
| Have AMD RX 7900 series | ROCm is worth trying, budget an afternoon |
| Have older AMD GPU | Check the ROCm support matrix first, might be CPU time |
| No dedicated GPU | CPU inference with small/medium models, surprisingly viable |
| Want to buy a GPU for AI | Buy NVIDIA. I know. I’m sorry. |
The “Just Buy NVIDIA” Reality Check
I don’t love saying this. AMD makes good hardware, their open-source commitment is admirable, and the price-to-VRAM ratio is often better. But the AI inference ecosystem was built on CUDA, and that matters when:
- You want to run the latest models with day-one support
- You want to use frameworks other than just Ollama (PyTorch, etc.)
- You want to spend an afternoon setting things up, not a weekend debugging ROCm issues
If you already have an AMD GPU, absolutely use it — ROCm is good enough now that you shouldn’t buy new hardware just for CUDA. But if you’re buying from scratch specifically for local AI? The 4070 Super at ~$600 with 12GB VRAM will cause you less grief than the equivalent AMD card, even if the AMD card looks better on a spec sheet.
Closing Thoughts
Running AI locally is legitimately exciting. The models have gotten good enough that a well-quantized 7B model running on your home server can handle a surprisingly wide range of tasks. The tooling (Ollama, Open WebUI, llama.cpp) has matured to the point where the Docker Compose snippets above genuinely work without cursing at your terminal for hours.
NVIDIA is still the easy path. AMD is the principled path that respects your freedom and occasionally tests your patience. CPU inference is the “I refuse to buy new hardware” path, and honestly, respect.
Pick your path, run the compose file, pull a model, and enjoy the surreal experience of a language model running on your own hardware — no API key, no usage limits, no one else reading your prompts.
The GPU acceleration ecosystem isn’t perfect, but it’s genuinely functional now. That’s more than could be said two years ago, and that’s worth something.