Local LLMs have finally escaped the “experimental” phase where you spend three hours just trying to link CUDA libraries. But with choice comes the inevitable tech-bro debate: which serving engine actually belongs in your stack?
We aren’t talking about “which is better”—we’re talking about which one won’t make you want to throw your workstation out the window. vLLM, llama.cpp, and Ollama all solve the same problem (turning weights into words), but they do it with vastly different philosophies.
Key Takeaways
-
vLLM: The throughput king. Use this if you’re serving multiple users and have VRAM to burn.
-
llama.cpp: The Swiss Army knife. Runs on anything (CPUs, Apple Silicon, old GPUs) using GGUF magic.
-
Ollama: The “it just works” option. Basically Docker for LLMs. Perfect for local dev and lazy Sundays. based on llama.cpp code at its base.
The Contenders: A Quick Reality Check
FeaturevLLMllama.cppOllamaPrimary GoalMax throughputPortability / CPUEase of Use / Local DevMemory TechPagedAttention (VRAM heavy)GGUF QuantizationBundled llama.cpp + APIBest HardwareNVIDIA / High-end AMDLiterally anything (even a Pi)Consumer GPUs / Mac M-seriesAPI StyleOpenAI-CompatibleNative + OpenAI wrapperCustom REST + OpenAI
1. Ollama: The One-Click Wonder
Ollama is the gateway drug to local AI. It abstracts away the complexity of model management. You don’t “download a GGUF file”; you just ollama pull llama3.1. It handles the orchestration and model unloading automatically.
The Catch: It’s a bit of a “black box.” If you need to tweak hyper-specific engine parameters for a production load, Ollama will eventually get in your way.
services: ollama: image: ollama/ollama:latest container_name: ollama ports: - "11434:11434" volumes: - ./ollama_data:/root/.ollama # Remove the 'deploy' block if you don't have an NVIDIA GPU deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stoppedservices: ollama: image: ollama/ollama:latest container_name: ollama ports: - "11434:11434" volumes: - ./ollama_data:/root/.ollama # Remove the 'deploy' block if you don't have an NVIDIA GPU deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stopped2. vLLM: For the Performance Junkies
If you are building an actual app with more than one concurrent user, vLLM is the gold standard. It uses PagedAttention, which manages KV cache memory like an OS manages virtual memory. This prevents memory fragmentation and allows for massive batch sizes.
The Catch: It is VRAM-hungry. It will pre-allocate almost all your VRAM by default to manage its cache. It also strictly requires a GPU (ideally NVIDIA).
services: vllm: image: vllm/vllm-openai:latest container_name: vllm-server command: > --model facebook/opt-125m --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.9 environment: - HF_TOKEN=${HF_TOKEN} # Only needed for gated models ports: - "8000:8000" volumes: - ~/.cache/huggingface:/root/.cache/huggingface ipc: host # Critical for high-performance memory sharing deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stoppedservices: vllm: image: vllm/vllm-openai:latest container_name: vllm-server command: > --model facebook/opt-125m --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.9 environment: - HF_TOKEN=${HF_TOKEN} # Only needed for gated models ports: - "8000:8000" volumes: - ~/.cache/huggingface:/root/.cache/huggingface ipc: host # Critical for high-performance memory sharing deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stopped3. llama.cpp: The Unstoppable Engine
The project that started it all. llama.cpp is written in pure C/C++ with zero dependencies. It’s the engine that powers Ollama, but running it raw gives you total control. It’s the only choice if you’re running on a Mac or a CPU-only server.
The Catch: You have to manage your own GGUF files. It’s a bit more “manual labor” than the others.
GPU Accelerated (NVIDIA)
Use this if you want to offload specific layers to your GPU. We’ve added --fa (Flash Attention) and --fit-on to ensure the engine scales to your available hardware automatically.
services: llama-server: image: ghcr.io/ggml-org/llama.cpp:server-cuda container_name: llama-cpp-gpu ports: - "8080:8080" volumes: - ./models:/models command: > -m /models/llama-3-8b-instruct.Q4_K_M.gguf --host 0.0.0.0 --port 8080 --fa --fit-on -ngl 33 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stoppedservices: llama-server: image: ghcr.io/ggml-org/llama.cpp:server-cuda container_name: llama-cpp-gpu ports: - "8080:8080" volumes: - ./models:/models command: > -m /models/llama-3-8b-instruct.Q4_K_M.gguf --host 0.0.0.0 --port 8080 --fa --fit-on -ngl 33 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stoppedCPU-Only (For MoE or Embedding Models)
Mixture of Experts (MoE) models can be massive. If you don’t have 80GB of VRAM, CPU offloading is your only path. This setup is also ideal for lightweight embedding models running on standard cloud instances.
services: llama-cpu: image: ghcr.io/ggml-org/llama.cpp:server-cpu container_name: llama-cpp-cpu ports: - "8081:8080" volumes: - ./models:/models command: > -m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf --host 0.0.0.0 --port 8080 --threads 8 --fa --fit-on -ngl 0 restart: unless-stoppedservices: llama-cpu: image: ghcr.io/ggml-org/llama.cpp:server-cpu container_name: llama-cpp-cpu ports: - "8081:8080" volumes: - ./models:/models command: > -m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf --host 0.0.0.0 --port 8080 --threads 8 --fa --fit-on -ngl 0 restart: unless-stoppedThe Verdict: Which one for you?
Stop over-engineering. If you just want to talk to a model on your laptop, Ollama is the play. It’s clean, the API is solid, and you don’t need to know what a “quantization level” is.
If you’re launching a SaaS and expect 50 people to hit your endpoint at once, you need vLLM. Anything else will choke on the concurrency.
And if you’re trying to squeeze a 70B model onto a Mac Studio or a rig with mixed RAM/VRAM? llama.cpp is the only friend you have.