AMD Lemonade: Local LLM Serving for AMD GPUs

The AMD Problem (That Might Be Solved)

You’re an AMD GPU owner. You spent good money on that RX 7900 XT or 7900 GRE because the specs looked solid. Then you tried running Ollama for local LLM inference and… it works, but it’s not pretty. ROCm has always been the awkward sibling to CUDA. It works eventually, but not before you’ve debugged kernel compatibility issues, hunted for docker images that don’t scream, and watched your token throughput lag behind your Nvidia-owning friends by 30–40%.

Then there’s the NPU problem. If you’ve got a Ryzen AI laptop (7840U, 7940HS, etc.), you’ve got a dedicated Neural Processing Unit sitting there doing basically nothing. AMD spent the R&D budget to put a 16-core NPU on die, and the software ecosystem has largely shrugged.

Enter Lemonade: AMD’s official open-source local LLM inference server. When life gives you AMD, make Lemonade.

What Lemonade Actually Is

Lemonade is AMD’s answer to Ollama, llama.cpp, and LocalAI. It’s an inference server built specifically to exploit AMD hardware—both discrete GPUs (via ROCm) and integrated/discrete NPUs (via XDNA). The magic part: it’s actually fast, and the API is OpenAI-compatible, so you can slot it into any existing tool chain that expects ChatGPT or OpenAI’s endpoints.

The server handles quantization, model loading, KV cache management, and batched inference out of the box. You pull a model, the server spins up, and you hit /v1/chat/completions like you would with OpenAI’s API. That’s it.

Why the NPU Part Matters

Discrete AMD GPUs (RX 7000 series, MI-series) are solid, but they’re not in everyone’s hands. Ryzen AI NPUs, though? They’re shipping in laptops right now. The 16-core XDNA unit in a Ryzen AI processor can run small-to-medium models with surprising efficiency—and crucially, without fanfare or power draw.

Lemonade’s NPU support means you can run a 7B model on your laptop’s NPU while your GPU handles gaming or video work. Or, on a Ryzen AI desktop chip (if you’ve got one), you’re looking at a dedicated coprocessor for inference that doesn’t compete for memory bandwidth with your main workloads.

Installation

pip install amd-lemonade

Or, if you prefer Docker:

docker pull amd/lemonade:latest

That’s genuinely it. No ROCm versioning drama. No “which docker image do I need?” Lemonade handles the AMD driver discovery at runtime.

Getting Started: Pull a Model and Serve It

lemonade pull llama2:7b
lemonade serve llama2:7b --port 8000

The server starts on localhost:8000. Now your OpenAI-compatible endpoint is live.

Test It With curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "messages": [{"role": "user", "content": "Explain containers like I am five."}],
    "temperature": 0.7
  }'

Response looks exactly like OpenAI’s API. Drop-in replacement energy.

Use It From Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed-local"  # Lemonade doesn't enforce API keys
)

response = client.chat.completions.create(
    model="llama2:7b",
    messages=[
        {"role": "system", "content": "You are a helpful DevOps expert."},
        {"role": "user", "content": "Why would I use containers instead of VMs?"}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

Same client library, same API—just pointing at your local machine.

Docker Compose for a Persistent Server

version: '3.8'
services:
  lemonade:
    image: amd/lemonade:latest
    ports:
      - "8000:8000"
    environment:
      - ROCM_HOME=/opt/rocm
      - HSA_OVERRIDE_GFX_VERSION=gfx90a
    volumes:
      - lemonade-models:/root/.lemonade/models
    restart: unless-stopped
    command: serve llama2:7b --port 8000

volumes:
  lemonade-models:

Spin it up with docker compose up -d. Your inference server persists across restarts, and models are cached in a named volume.

Performance: What to Expect

On an RX 7900 XT: ~60–80 tokens/sec for a 7B model. On an RX 7600: ~20–30 tokens/sec (still usable).

On a Ryzen AI NPU: ~10–20 tokens/sec for a 3.8B model (efficient, fits in the power budget of a laptop). CPU fallback gets you ~2–5 tokens/sec (viable for small models, terrible for anything else).

For comparison, llama.cpp on the same hardware runs ~5–10% faster on GPU but requires more setup. Ollama is comparable but has historically been more CUDA-optimized.

Lemonade vs. Ollama vs. llama.cpp (AMD Edition)

Lemonade: Official AMD support, NPU-aware, clean API, fast iteration. Best if you want the “blessed” AMD experience.

Ollama: More mature, broader model library, bigger community. Works on AMD but ROCm support feels like an afterthought.

llama.cpp: Bleeding-edge quantization, maximum fine-tuning, slightly faster. Heavier learning curve; you’re managing everything yourself.

Pick Lemonade if you’ve got AMD hardware and want something that just works. Pick Ollama if you need maximum compatibility across models. Pick llama.cpp if you enjoy suffering.

Models That Work Well

Lemonade officially supports the expected suspects: Llama 2/3/3.1, Gemma, Mistral, Mixtral, Qwen, and Phi. Quantized versions (Q4, Q5) work great. Avoid the 70B models unless you’ve got an MI300X (AMD’s data center GPU). The mid-range stuff—7B, 13B, occasionally 34B—is where Lemonade shines.

The Upshot

If you’re an AMD user who’s felt left out while Nvidia users casually deployed local LLMs, Lemonade is the first moment in years where AMD’s actually competitive. It’s not going to beat a 4090, but it doesn’t need to. It’s fast enough for real work, the NPU support is legitimately useful, and the OpenAI-compatible API means you’re not learning new tooling.

When life gives you AMD, stop buying lemons. Start with Lemonade.

AMD Lemonade: Local LLM Serving for AMD GPUs

The AMD Problem (That Might Be Solved)

What Lemonade Actually Is

Why the NPU Part Matters

Installation

Getting Started: Pull a Model and Serve It

Test It With curl

Use It From Python

Docker Compose for a Persistent Server

Performance: What to Expect

Lemonade vs. Ollama vs. llama.cpp (AMD Edition)

Models That Work Well

The Upshot

Responses from around the web

Discussion

Related Posts

Beyond RAG: When a Virtual Filesystem Works Better

1-Bit LLMs: The Quantization Endgame

RAG on a Budget: Building a Knowledge Base with Ollama & ChromaDB

Exploring the Diverse World of LLM Models