AMD Lemonade: Local LLM Serving for AMD GPUs
AMD finally has a fast, open source local LLM server that uses both GPU and NPU. If you've been jealous of Nvidia users, Lemonade is worth your time.
All the articles with the tag "llm".
AMD finally has a fast, open source local LLM server that uses both GPU and NPU. If you've been jealous of Nvidia users, Lemonade is worth your time.
JSON mode forces models to output valid JSON. When it's a lifesaver vs. when it's overkill and makes the model worse.
Temperature and top_p control randomness in LLMs. No probability theory needed. Just practical intuition and how to tune them.
vLLM, llama.cpp, and Ollama all run local LLMs — compare throughput, memory use, GPU support, and which fits your hardware.
RAG breaks documents into chunks. But what chunk size? Too small and context is lost. Too large and semantic search fails. Here's how to pick.
Stop juggling 17 different LLM SDKs. LiteLLM and vLLM give you a unified OpenAI-compatible API for every model — local or cloud, fast and production-ready.
System prompts are your secret weapon. How they work, why they matter more than you think, and 5 patterns that actually change model behavior.
Q4_K_M is the default, but it's not magic. When Q3, Q5, or Q6 makes sense. How to benchmark quantization tradeoffs on your hardware.
Ollama can load one model at a time on limited hardware. How to switch between models, use CPU offloading, and manage VRAM intelligently.
What's the actual difference between context window and token limit? Why one model says 8K and another says 128K. A practical breakdown.
Why your GPU fills up with Ollama. How to inspect VRAM, tune keep-alive, force-unload models with a single request, and stop the reload pain in 2026.
Learn how to build a local RAG system using Ollama and ChromaDB for free. Step-by-step guide with Docker Compose, Python code, chunking strategies, and real-world examples.