Skip to content
SumGuy's Ramblings
Go back

LLM Backends: vLLM vs llama.cpp vs Ollama

Local LLMs have finally escaped the “experimental” phase where you spend three hours just trying to link CUDA libraries. But with choice comes the inevitable tech-bro debate: which serving engine actually belongs in your stack?

We aren’t talking about “which is better”—we’re talking about which one won’t make you want to throw your workstation out the window. vLLM, llama.cpp, and Ollama all solve the same problem (turning weights into words), but they do it with vastly different philosophies.

Key Takeaways

The Contenders: A Quick Reality Check

FeaturevLLMllama.cppOllamaPrimary GoalMax throughputPortability / CPUEase of Use / Local DevMemory TechPagedAttention (VRAM heavy)GGUF QuantizationBundled llama.cpp + APIBest HardwareNVIDIA / High-end AMDLiterally anything (even a Pi)Consumer GPUs / Mac M-seriesAPI StyleOpenAI-CompatibleNative + OpenAI wrapperCustom REST + OpenAI

1. Ollama: The One-Click Wonder

Ollama is the gateway drug to local AI. It abstracts away the complexity of model management. You don’t “download a GGUF file”; you just ollama pull llama3.1. It handles the orchestration and model unloading automatically.

The Catch: It’s a bit of a “black box.” If you need to tweak hyper-specific engine parameters for a production load, Ollama will eventually get in your way.

# docker-compose.ollama.yml

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ./ollama_data:/root/.ollama
    # Remove the 'deploy' block if you don't have an NVIDIA GPU
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
# docker-compose.ollama.yml

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ./ollama_data:/root/.ollama
    # Remove the 'deploy' block if you don't have an NVIDIA GPU
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

2. vLLM: For the Performance Junkies

If you are building an actual app with more than one concurrent user, vLLM is the gold standard. It uses PagedAttention, which manages KV cache memory like an OS manages virtual memory. This prevents memory fragmentation and allows for massive batch sizes.

The Catch: It is VRAM-hungry. It will pre-allocate almost all your VRAM by default to manage its cache. It also strictly requires a GPU (ideally NVIDIA).

# docker-compose.vllm.yml

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm-server
    command: >
      --model facebook/opt-125m 
      --host 0.0.0.0 
      --port 8000
      --gpu-memory-utilization 0.9
    environment:
      - HF_TOKEN=${HF_TOKEN} # Only needed for gated models
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ipc: host # Critical for high-performance memory sharing
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
# docker-compose.vllm.yml

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm-server
    command: >
      --model facebook/opt-125m 
      --host 0.0.0.0 
      --port 8000
      --gpu-memory-utilization 0.9
    environment:
      - HF_TOKEN=${HF_TOKEN} # Only needed for gated models
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ipc: host # Critical for high-performance memory sharing
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

3. llama.cpp: The Unstoppable Engine

The project that started it all. llama.cpp is written in pure C/C++ with zero dependencies. It’s the engine that powers Ollama, but running it raw gives you total control. It’s the only choice if you’re running on a Mac or a CPU-only server.

The Catch: You have to manage your own GGUF files. It’s a bit more “manual labor” than the others.

GPU Accelerated (NVIDIA)

Use this if you want to offload specific layers to your GPU. We’ve added --fa (Flash Attention) and --fit-on to ensure the engine scales to your available hardware automatically.

# docker-compose.llamacpp-gpu.yml

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: llama-cpp-gpu
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models
    command: >
      -m /models/llama-3-8b-instruct.Q4_K_M.gguf
      --host 0.0.0.0
      --port 8080
      --fa 
      --fit-on
      -ngl 33 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
# docker-compose.llamacpp-gpu.yml

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: llama-cpp-gpu
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models
    command: >
      -m /models/llama-3-8b-instruct.Q4_K_M.gguf
      --host 0.0.0.0
      --port 8080
      --fa 
      --fit-on
      -ngl 33 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

CPU-Only (For MoE or Embedding Models)

Mixture of Experts (MoE) models can be massive. If you don’t have 80GB of VRAM, CPU offloading is your only path. This setup is also ideal for lightweight embedding models running on standard cloud instances.

# docker-compose.llamacpp-cpu.yml

services:
  llama-cpu:
    image: ghcr.io/ggml-org/llama.cpp:server-cpu
    container_name: llama-cpp-cpu
    ports:
      - "8081:8080"
    volumes:
      - ./models:/models
    command: >
      -m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf
      --host 0.0.0.0
      --port 8080
      --threads 8
      --fa
      --fit-on
      -ngl 0
    restart: unless-stopped
# docker-compose.llamacpp-cpu.yml

services:
  llama-cpu:
    image: ghcr.io/ggml-org/llama.cpp:server-cpu
    container_name: llama-cpp-cpu
    ports:
      - "8081:8080"
    volumes:
      - ./models:/models
    command: >
      -m /models/mixtral-8x7b-v0.1.Q4_K_M.gguf
      --host 0.0.0.0
      --port 8080
      --threads 8
      --fa
      --fit-on
      -ngl 0
    restart: unless-stopped

The Verdict: Which one for you?

Stop over-engineering. If you just want to talk to a model on your laptop, Ollama is the play. It’s clean, the API is solid, and you don’t need to know what a “quantization level” is.

If you’re launching a SaaS and expect 50 people to hit your endpoint at once, you need vLLM. Anything else will choke on the concurrency.

And if you’re trying to squeeze a 70B model onto a Mac Studio or a rig with mixed RAM/VRAM? llama.cpp is the only friend you have.


Share this post on:

Previous Post
Obsidian LiveSync: Self-Hosted Sync Without Paying for the Privilege
Next Post
The Zero-Trust Home Lab