Skip to content
SumGuy's Ramblings
Go back

LiteLLM & vLLM: One API to Rule All Your Models

Your Code Is Calling 17 Different SDKs and It Needs to Stop

Picture the scene: you started with the OpenAI SDK. Reasonable. Then someone mentioned Anthropic Claude was better for long documents, so you added that SDK. Then your boss wanted cost savings so you wired in Groq. Then you got into self-hosting and bolted on Ollama. Then a shiny new model dropped on Mistral’s API and honestly you just copy-pasted the HTTP calls at that point because you’d lost the will to live.

Your requirements.txt now looks like a hostage negotiation and every model has its own authentication scheme, its own response format, its own idea of what a “message” looks like. When the OpenAI API changes, you update one place. When Anthropic changes theirs — different place. Groq? Third place. Ollama? It does its own thing entirely.

There is a better way. Two better ways, actually — and they work great together.


What Is LiteLLM?

LiteLLM is a proxy server that puts a single OpenAI-compatible API endpoint in front of over 100 LLM providers. You point your code at http://localhost:4000 and speak plain OpenAI API. LiteLLM handles the translation to whatever backend you actually want to use — Anthropic, Groq, Mistral, Azure OpenAI, Bedrock, Cohere, Ollama, vLLM, or a hundred other things.

The value proposition is simple: your application code never changes when you swap models. You just update the LiteLLM config.

Beyond the basic proxy, LiteLLM also gives you:

It is the thing you build once and stop thinking about.


What Is vLLM?

vLLM is a high-performance inference server for running open-weight models locally or on your own servers. If you have tried Ollama, you already understand the category — but vLLM approaches it differently.

Where Ollama optimizes for ease of use and runs well on consumer hardware with modest RAM, vLLM optimizes for throughput. It was built by researchers at UC Berkeley and introduced two key innovations:

PagedAttention treats the KV cache (the memory that stores context during inference) like virtual memory in an OS. Instead of pre-allocating a fixed block per request, it pages memory dynamically. This means far less waste and dramatically more concurrent requests on the same GPU.

Continuous batching means the server doesn’t wait to fill a batch before starting inference. New requests join in-flight batches mid-generation. The GPU is almost never idle.

The practical result: vLLM can serve 10-20x more requests per second than naive inference setups on the same hardware, at comparable latency per request.

vLLM exposes an OpenAI-compatible API out of the box, which is exactly why it plays so well with LiteLLM.


Docker Compose: vLLM

Here is a minimal vLLM setup. You will need a GPU with CUDA support and enough VRAM for your chosen model. For a 7B model quantized to 4-bit, 8GB of VRAM is workable. For a 70B model, you are looking at 40GB+.

# docker-compose.vllm.yml
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.2
      --max-model-len 8192
      --gpu-memory-utilization 0.90
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Start it with docker compose -f docker-compose.vllm.yml up -d. After the model downloads (first run only), you have an OpenAI-compatible server at http://localhost:8000. You can test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "Hello from vLLM"}]
  }'

Docker Compose: LiteLLM

LiteLLM needs a config file that tells it about your backends:

# litellm-config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-3-5-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: mistral-local
    litellm_params:
      model: openai/mistralai/Mistral-7B-Instruct-v0.2
      api_base: http://vllm:8000/v1
      api_key: fake-key  # vLLM doesn't check this

  - model_name: fast-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://ollama:11434

router_settings:
  routing_strategy: least-busy

litellm_settings:
  success_callback: []
  set_verbose: false

And the Compose file:

# docker-compose.litellm.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    command: --config /app/config.yaml --port 4000

Now your application talks to http://localhost:4000 using any OpenAI SDK. Switch models by changing the model field in your request — not your code.


The Full Stack: LiteLLM in Front of vLLM

This is where it gets good. Run them together:

# docker-compose.yml
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.2
      --max-model-len 8192
      --gpu-memory-utilization 0.90
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=sk-my-homelab-key
    command: --config /app/config.yaml --port 4000
    depends_on:
      - vllm

Your application uses http://litellm:4000 (or localhost:4000 from the host). It requests mistral-local for cheap tasks, falls back to gpt-4o for anything that really needs it, and you never touch application code again. Add a new model to the config, reload LiteLLM, done.

You can also set up fallback chains in the config so that if your local vLLM instance is overloaded or down, requests automatically escalate to a cloud provider. Your uptime stops depending on your homelab’s mood.


LiteLLM vs Ollama: When to Use What

This comes up constantly, so let’s be direct.

Ollama is a fantastic local model runner. It handles model downloads, GGUF quantization, and basic serving with essentially zero configuration. If you want to run a local model on a laptop for personal use, Ollama is the right answer. It runs on CPU, handles Apple Silicon well, and the CLI is genuinely pleasant.

vLLM is for when you care about serving multiple users or high-throughput workloads. If you are running a service that multiple people or processes hit simultaneously, vLLM’s continuous batching and PagedAttention will smoke Ollama on requests-per-second. It requires CUDA (NVIDIA GPU), is more complex to configure, and is somewhat overkill for solo use.

LiteLLM is not a model runner — it is a router and proxy. You can put it in front of Ollama, vLLM, cloud APIs, or any combination. The question is not “LiteLLM vs Ollama” — it is “do I want a unified API layer over my model backends?” For any project that might grow or touch more than one model, the answer is yes.

The practical decision tree:


Spend Tracking and Virtual Keys

One underrated LiteLLM feature: virtual API keys with budgets. You can create per-user or per-application keys through the LiteLLM API and assign monthly spend limits. When a key hits its limit, requests are rejected. This is genuinely useful if you are sharing API access with teammates or running multiple projects under one billing account.

# Create a virtual key with a $10/month budget
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-my-homelab-key" \
  -H "Content-Type: application/json" \
  -d '{"max_budget": 10, "budget_duration": "1mo", "metadata": {"project": "my-chatbot"}}'

The dashboard at http://localhost:4000/ui gives you a spend breakdown by key, model, and time period. For self-hosters with cloud API costs, this hits different.


The Payoff

Set this up once. Commit the config to your repo. Now every project in your homelab talks to one endpoint. When a better model drops, you add it to the config. When your GPU is busy, fallbacks send requests to the cloud. When someone asks what you spent on AI last month, you actually know.

Your codebase drops from 17 SDKs to one. Your requirements.txt exhales. Your future self — six months from now when you are adding the eighth model and changing nothing in your application code — will owe you a coffee.

That is the art of wasting time efficiently.


Share this post on:

Previous Post
Multi-Stage Docker Builds: Stop Shipping Your node_modules to Production
Next Post
Building a Private Docker Registry with Harbor