Local LLMs have finally escaped the “experimental” phase where you spend three hours just trying to link CUDA libraries. But with choice comes the inevitable tech-bro debate: which serving engine actually belongs in your stack?
We aren’t talking about “which is better”—we’re talking about which one won’t make you want to throw your workstation out the window. vLLM, llama.cpp, and Ollama all solve the same problem (turning weights into words), but they do it with vastly different philosophies.
Key Takeaways
-
vLLM: The throughput king. Use this if you’re serving multiple users and have VRAM to burn.
-
llama.cpp: The Swiss Army knife. Runs on anything (CPUs, Apple Silicon, old GPUs) using GGUF magic.
-
Ollama: The “it just works” option. Basically Docker for LLMs. Perfect for local dev and lazy Sundays. based on llama.cpp code at its base.
The Contenders: A Quick Reality Check
FeaturevLLMllama.cppOllamaPrimary GoalMax throughputPortability / CPUEase of Use / Local DevMemory TechPagedAttention (VRAM heavy)GGUF QuantizationBundled llama.cpp + APIBest HardwareNVIDIA / High-end AMDLiterally anything (even a Pi)Consumer GPUs / Mac M-seriesAPI StyleOpenAI-CompatibleNative + OpenAI wrapperCustom REST + OpenAI
1. Ollama: The One-Click Wonder
Ollama is the gateway drug to local AI. It abstracts away the complexity of model management. You don’t “download a GGUF file”; you just ollama pull llama3.1. It handles the orchestration and model unloading automatically.
The Catch: It’s a bit of a “black box.” If you need to tweak hyper-specific engine parameters for a production load, Ollama will eventually get in your way.
# docker-compose.ollama.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ./ollama_data:/root/.ollama
# Remove the 'deploy' block if you don't have an NVIDIA GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
# docker-compose.ollama.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ./ollama_data:/root/.ollama
# Remove the 'deploy' block if you don't have an NVIDIA GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
2. vLLM: For the Performance Junkies
If you are building an actual app with more than one concurrent user, vLLM is the gold standard. It uses PagedAttention, which manages KV cache memory like an OS manages virtual memory. This prevents memory fragmentation and allows for massive batch sizes.
The Catch: It is VRAM-hungry. It will pre-allocate almost all your VRAM by default to manage its cache. It also strictly requires a GPU (ideally NVIDIA).
# docker-compose.vllm.yml
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
command: >
--model facebook/opt-125m
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.9
environment:
- HF_TOKEN=${HF_TOKEN} # Only needed for gated models
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ipc: host # Critical for high-performance memory sharing
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
# docker-compose.vllm.yml
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
command: >
--model facebook/opt-125m
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.9
environment:
- HF_TOKEN=${HF_TOKEN} # Only needed for gated models
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ipc: host # Critical for high-performance memory sharing
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
3. llama.cpp: The Unstoppable Engine
The project that started it all. llama.cpp is written in pure C/C++ with zero dependencies. It’s the engine that powers Ollama, but running it raw gives you total control. It’s the only choice if you’re running on a Mac or a CPU-only server.
The Catch: You have to manage your own GGUF files. It’s a bit more “manual labor” than the others.
GPU Accelerated (NVIDIA)
Use this if you want to offload specific layers to your GPU. We’ve added --fa (Flash Attention) and --fit-on to ensure the engine scales to your available hardware automatically.
# docker-compose.llamacpp-gpu.yml
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: llama-cpp-gpu
ports:
- "8080:8080"
volumes:
- ./models:/models
command: >
-m /models/llama-3-8b-instruct.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--fa
--fit-on
-ngl 33
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
# docker-compose.llamacpp-gpu.yml
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: llama-cpp-gpu
ports:
- "8080:8080"
volumes:
- ./models:/models
command: >
-m /models/llama-3-8b-instruct.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--fa
--fit-on
-ngl 33
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
CPU-Only (For MoE or Embedding Models)
Mixture of Experts (MoE) models can be massive. If you don’t have 80GB of VRAM, CPU offloading is your only path. This setup is also ideal for lightweight embedding models running on standard cloud instances.
# docker-compose.llamacpp-cpu.yml
services:
llama-cpu:
image: ghcr.io/ggml-org/llama.cpp:server-cpu
container_name: llama-cpp-cpu
ports:
- "8081:8080"
volumes:
- ./models:/models
command: >
-m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--threads 8
--fa
--fit-on
-ngl 0
restart: unless-stopped
# docker-compose.llamacpp-cpu.yml
services:
llama-cpu:
image: ghcr.io/ggml-org/llama.cpp:server-cpu
container_name: llama-cpp-cpu
ports:
- "8081:8080"
volumes:
- ./models:/models
command: >
-m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--threads 8
--fa
--fit-on
-ngl 0
restart: unless-stopped
The Verdict: Which one for you?
Stop over-engineering. If you just want to talk to a model on your laptop, Ollama is the play. It’s clean, the API is solid, and you don’t need to know what a “quantization level” is.
If you’re launching a SaaS and expect 50 people to hit your endpoint at once, you need vLLM. Anything else will choke on the concurrency.
And if you’re trying to squeeze a 70B model onto a Mac Studio or a rig with mixed RAM/VRAM? llama.cpp is the only friend you have.