Skip to content
Go back

LLM Backends: vLLM vs llama.cpp vs Ollama

By SumGuy 5 min read
LLM Backends: vLLM vs llama.cpp vs Ollama

Local LLMs have finally escaped the “experimental” phase where you spend three hours just trying to link CUDA libraries. But with choice comes the inevitable tech-bro debate: which serving engine actually belongs in your stack?

We aren’t talking about “which is better”—we’re talking about which one won’t make you want to throw your workstation out the window. vLLM, llama.cpp, and Ollama all solve the same problem (turning weights into words), but they do it with vastly different philosophies.

Key Takeaways

The Contenders: A Quick Reality Check

FeaturevLLMllama.cppOllamaPrimary GoalMax throughputPortability / CPUEase of Use / Local DevMemory TechPagedAttention (VRAM heavy)GGUF QuantizationBundled llama.cpp + APIBest HardwareNVIDIA / High-end AMDLiterally anything (even a Pi)Consumer GPUs / Mac M-seriesAPI StyleOpenAI-CompatibleNative + OpenAI wrapperCustom REST + OpenAI

1. Ollama: The One-Click Wonder

Ollama is the gateway drug to local AI. It abstracts away the complexity of model management. You don’t “download a GGUF file”; you just ollama pull llama3.1. It handles the orchestration and model unloading automatically.

The Catch: It’s a bit of a “black box.” If you need to tweak hyper-specific engine parameters for a production load, Ollama will eventually get in your way.

docker-compose.ollama.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ./ollama_data:/root/.ollama
# Remove the 'deploy' block if you don't have an NVIDIA GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
docker-compose.ollama.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ./ollama_data:/root/.ollama
# Remove the 'deploy' block if you don't have an NVIDIA GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped

2. vLLM: For the Performance Junkies

If you are building an actual app with more than one concurrent user, vLLM is the gold standard. It uses PagedAttention, which manages KV cache memory like an OS manages virtual memory. This prevents memory fragmentation and allows for massive batch sizes.

The Catch: It is VRAM-hungry. It will pre-allocate almost all your VRAM by default to manage its cache. It also strictly requires a GPU (ideally NVIDIA).

docker-compose.vllm.yml
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
command: >
--model facebook/opt-125m
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.9
environment:
- HF_TOKEN=${HF_TOKEN} # Only needed for gated models
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ipc: host # Critical for high-performance memory sharing
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
docker-compose.vllm.yml
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
command: >
--model facebook/opt-125m
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.9
environment:
- HF_TOKEN=${HF_TOKEN} # Only needed for gated models
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ipc: host # Critical for high-performance memory sharing
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped

3. llama.cpp: The Unstoppable Engine

The project that started it all. llama.cpp is written in pure C/C++ with zero dependencies. It’s the engine that powers Ollama, but running it raw gives you total control. It’s the only choice if you’re running on a Mac or a CPU-only server.

The Catch: You have to manage your own GGUF files. It’s a bit more “manual labor” than the others.

GPU Accelerated (NVIDIA)

Use this if you want to offload specific layers to your GPU. We’ve added --fa (Flash Attention) and --fit-on to ensure the engine scales to your available hardware automatically.

docker-compose.llamacpp-gpu.yml
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: llama-cpp-gpu
ports:
- "8080:8080"
volumes:
- ./models:/models
command: >
-m /models/llama-3-8b-instruct.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--fa
--fit-on
-ngl 33
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
docker-compose.llamacpp-gpu.yml
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: llama-cpp-gpu
ports:
- "8080:8080"
volumes:
- ./models:/models
command: >
-m /models/llama-3-8b-instruct.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--fa
--fit-on
-ngl 33
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped

CPU-Only (For MoE or Embedding Models)

Mixture of Experts (MoE) models can be massive. If you don’t have 80GB of VRAM, CPU offloading is your only path. This setup is also ideal for lightweight embedding models running on standard cloud instances.

docker-compose.llamacpp-cpu.yml
services:
llama-cpu:
image: ghcr.io/ggml-org/llama.cpp:server-cpu
container_name: llama-cpp-cpu
ports:
- "8081:8080"
volumes:
- ./models:/models
command: >
-m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--threads 8
--fa
--fit-on
-ngl 0
restart: unless-stopped
docker-compose.llamacpp-cpu.yml
services:
llama-cpu:
image: ghcr.io/ggml-org/llama.cpp:server-cpu
container_name: llama-cpp-cpu
ports:
- "8081:8080"
volumes:
- ./models:/models
command: >
-m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--threads 8
--fa
--fit-on
-ngl 0
restart: unless-stopped

The Verdict: Which one for you?

Stop over-engineering. If you just want to talk to a model on your laptop, Ollama is the play. It’s clean, the API is solid, and you don’t need to know what a “quantization level” is.

If you’re launching a SaaS and expect 50 people to hit your endpoint at once, you need vLLM. Anything else will choke on the concurrency.

And if you’re trying to squeeze a 70B model onto a Mac Studio or a rig with mixed RAM/VRAM? llama.cpp is the only friend you have.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
Shell Setup in 2026: Starship, Plugins, Fish
Next Post
RAG Chunking: Why Chunk Size Is Everything

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts