The AMD Problem (That Might Be Solved)
You’re an AMD GPU owner. You spent good money on that RX 7900 XT or 7900 GRE because the specs looked solid. Then you tried running Ollama for local LLM inference and… it works, but it’s not pretty. ROCm has always been the awkward sibling to CUDA. It works eventually, but not before you’ve debugged kernel compatibility issues, hunted for docker images that don’t scream, and watched your token throughput lag behind your Nvidia-owning friends by 30–40%.
Then there’s the NPU problem. If you’ve got a Ryzen AI laptop (7840U, 7940HS, etc.), you’ve got a dedicated Neural Processing Unit sitting there doing basically nothing. AMD spent the R&D budget to put a 16-core NPU on die, and the software ecosystem has largely shrugged.
Enter Lemonade: AMD’s official open-source local LLM inference server. When life gives you AMD, make Lemonade.
What Lemonade Actually Is
Lemonade is AMD’s answer to Ollama, llama.cpp, and LocalAI. It’s an inference server built specifically to exploit AMD hardware—both discrete GPUs (via ROCm) and integrated/discrete NPUs (via XDNA). The magic part: it’s actually fast, and the API is OpenAI-compatible, so you can slot it into any existing tool chain that expects ChatGPT or OpenAI’s endpoints.
The server handles quantization, model loading, KV cache management, and batched inference out of the box. You pull a model, the server spins up, and you hit /v1/chat/completions like you would with OpenAI’s API. That’s it.
Why the NPU Part Matters
Discrete AMD GPUs (RX 7000 series, MI-series) are solid, but they’re not in everyone’s hands. Ryzen AI NPUs, though? They’re shipping in laptops right now. The 16-core XDNA unit in a Ryzen AI processor can run small-to-medium models with surprising efficiency—and crucially, without fanfare or power draw.
Lemonade’s NPU support means you can run a 7B model on your laptop’s NPU while your GPU handles gaming or video work. Or, on a Ryzen AI desktop chip (if you’ve got one), you’re looking at a dedicated coprocessor for inference that doesn’t compete for memory bandwidth with your main workloads.
Installation
pip install amd-lemonadeOr, if you prefer Docker:
docker pull amd/lemonade:latestThat’s genuinely it. No ROCm versioning drama. No “which docker image do I need?” Lemonade handles the AMD driver discovery at runtime.
Getting Started: Pull a Model and Serve It
lemonade pull llama2:7blemonade serve llama2:7b --port 8000The server starts on localhost:8000. Now your OpenAI-compatible endpoint is live.
Test It With curl
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama2:7b", "messages": [{"role": "user", "content": "Explain containers like I am five."}], "temperature": 0.7 }'Response looks exactly like OpenAI’s API. Drop-in replacement energy.
Use It From Python
from openai import OpenAI
client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed-local" # Lemonade doesn't enforce API keys)
response = client.chat.completions.create( model="llama2:7b", messages=[ {"role": "system", "content": "You are a helpful DevOps expert."}, {"role": "user", "content": "Why would I use containers instead of VMs?"} ], temperature=0.7, max_tokens=512)
print(response.choices[0].message.content)Same client library, same API—just pointing at your local machine.
Docker Compose for a Persistent Server
version: '3.8'services: lemonade: image: amd/lemonade:latest ports: - "8000:8000" environment: - ROCM_HOME=/opt/rocm - HSA_OVERRIDE_GFX_VERSION=gfx90a volumes: - lemonade-models:/root/.lemonade/models restart: unless-stopped command: serve llama2:7b --port 8000
volumes: lemonade-models:Spin it up with docker compose up -d. Your inference server persists across restarts, and models are cached in a named volume.
Performance: What to Expect
On an RX 7900 XT: ~60–80 tokens/sec for a 7B model. On an RX 7600: ~20–30 tokens/sec (still usable).
On a Ryzen AI NPU: ~10–20 tokens/sec for a 3.8B model (efficient, fits in the power budget of a laptop). CPU fallback gets you ~2–5 tokens/sec (viable for small models, terrible for anything else).
For comparison, llama.cpp on the same hardware runs ~5–10% faster on GPU but requires more setup. Ollama is comparable but has historically been more CUDA-optimized.
Lemonade vs. Ollama vs. llama.cpp (AMD Edition)
Lemonade: Official AMD support, NPU-aware, clean API, fast iteration. Best if you want the “blessed” AMD experience.
Ollama: More mature, broader model library, bigger community. Works on AMD but ROCm support feels like an afterthought.
llama.cpp: Bleeding-edge quantization, maximum fine-tuning, slightly faster. Heavier learning curve; you’re managing everything yourself.
Pick Lemonade if you’ve got AMD hardware and want something that just works. Pick Ollama if you need maximum compatibility across models. Pick llama.cpp if you enjoy suffering.
Models That Work Well
Lemonade officially supports the expected suspects: Llama 2/3/3.1, Gemma, Mistral, Mixtral, Qwen, and Phi. Quantized versions (Q4, Q5) work great. Avoid the 70B models unless you’ve got an MI300X (AMD’s data center GPU). The mid-range stuff—7B, 13B, occasionally 34B—is where Lemonade shines.
The Upshot
If you’re an AMD user who’s felt left out while Nvidia users casually deployed local LLMs, Lemonade is the first moment in years where AMD’s actually competitive. It’s not going to beat a 4090, but it doesn’t need to. It’s fast enough for real work, the NPU support is legitimately useful, and the OpenAI-compatible API means you’re not learning new tooling.
When life gives you AMD, stop buying lemons. Start with Lemonade.