So you’ve downloaded a large language model. Maybe it’s a 7B parameter beast, maybe you went full send on a 70B quantized monstrosity that makes your GPU fans sound like a jet engine. Either way, you need something to actually talk to this thing. You need an interface.
And if you’ve spent more than fifteen minutes in local LLM communities, two names keep coming up like recurring NPCs in a video game: Text Generation Web UI (affectionately known as “oobabooga” after its creator) and KoboldCpp. Both are free, open-source, and capable of running models on your own hardware. But they approach the problem from very different philosophies.
Let’s break down what each one does, how they differ, and — most importantly — which one deserves a spot in your workflow.
What Even Are These Things?
Before we get into the weeds, let’s establish the basics.
Text Generation Web UI (text-generation-webui, or just “ooba”) is a Gradio-based web interface for running large language models locally. Think of it as the Swiss Army knife of local LLM tools. It supports multiple backends, dozens of model formats, has an extension system, and a UI that looks like someone gave a machine learning researcher access to every Gradio widget simultaneously. It’s powerful. It’s flexible. It’s also occasionally the reason you’re reading Stack Overflow at 2 AM.
KoboldCpp is a single compiled binary (or a simple build from source) that loads GGUF models and serves them with a built-in web UI and API. Think of it as a purpose-built sports car versus ooba’s modular pickup truck with a camper shell, toolbox, and possibly a mounted telescope. KoboldCpp does fewer things, but it does them with startling simplicity.
Installation and Setup: The First Boss Fight
Text Generation Web UI
Installing ooba involves cloning a repo, running a start script, and waiting while it downloads half of PyPI. The project provides one-click installers for Windows, Linux, and macOS, which helps — but “one-click” is doing some heavy lifting here. Under the hood, it’s creating a conda environment, installing PyTorch, and pulling in backend-specific dependencies.
# The "one-click" approach
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
# Linux/macOS
./start_linux.sh
# Windows
start_windows.bat
The first run takes a while. You’ll select your GPU type (NVIDIA, AMD, Intel, Apple M-series, or CPU-only), and the script handles the rest. Mostly. Sometimes you’ll need to manually fix CUDA versions or deal with dependency conflicts, especially if you’re running AMD hardware. ROCm support exists but can feel like it’s held together with optimism and zip ties.
Docker setup is also available and arguably the sanest path if you want reproducibility:
docker run -d \
--gpus all \
-p 7860:7860 \
-v ./models:/app/models \
atinoda/text-generation-webui:latest
There are several community Docker images floating around, so check which one matches your hardware and desired backend.
KoboldCpp
KoboldCpp’s installation philosophy is “what if we just… didn’t?” On Windows, you download a single .exe file. You double-click it. It opens. That’s it. No Python. No conda. No dependency hell. No existential crisis.
On Linux, you can either grab a precompiled binary or build from source:
git clone https://github.com/LostRuins/koboldcpp
cd koboldcpp
make LLAMA_CUBLAS=1
# Or for Vulkan (AMD/Intel/NVIDIA)
make LLAMA_VULKAN=1
The build is fast because it’s C/C++ with minimal dependencies. The result is a single binary that includes the inference engine, the web UI, and the API server all in one package.
Docker works too, and it’s similarly straightforward:
docker run -d \
--gpus all \
-p 5001:5001 \
-v ./models:/models \
koboldai/koboldcpp-cuda:latest \
--model /models/your-model.gguf
Winner for setup simplicity: KoboldCpp, and it’s not even close. It’s the “it just works” option for people who want to run models without becoming a sysadmin first.
Model Format Support: The Format Wars
This is where things get interesting, because model formats in the local LLM world are a bit like video codecs in the early 2000s — there are too many of them, they’re not all compatible with each other, and someone on a forum is always angry about it.
KoboldCpp
KoboldCpp supports GGUF (and legacy GGML). That’s it. Full stop. GGUF is the format used by llama.cpp, the C/C++ inference engine that KoboldCpp is built on. GGUF models are quantized (compressed) versions of full-precision models, available in various quantization levels:
- Q2_K through Q8_0: Lower number = smaller file, less quality
- Q4_K_M: The sweet spot for most people — good balance of size and quality
- Q5_K_M and Q6_K: When you have the VRAM to spare
- IQ variants (imatrix quantizations): Slightly better quality at same size through importance-based quantization
GGUF is fantastic because it supports CPU inference, GPU offloading (put some layers on GPU, rest on CPU), and mixed CPU/GPU operation. If you have 8GB of VRAM and a 16GB model, GGUF lets you offload what fits to the GPU and run the rest on your CPU. It’s not as fast as full GPU inference, but it works.
Text Generation Web UI
Ooba supports everything. Well, almost everything:
- GGUF via the llama.cpp backend (same engine as KoboldCpp)
- GPTQ: An older GPU-only quantization format. Still works, still has models available, but increasingly playing second fiddle to newer options
- EXL2 via ExLlamaV2: The current king of GPU-only quantized inference. EXL2 quantizations are variable-bit, meaning different parts of the model can use different quantization levels based on sensitivity. Result: better quality per bit than uniform quantization
- AWQ: Another GPU quantization format, popular in some deployment scenarios
- HQQ: A newer quantization method that works without calibration data
- Full precision / FP16 / BF16: For when you have more VRAM than sense
- AQLM, QuIP#: Extreme low-bit quantization formats for the adventurous
The multi-backend approach means ooba can load models through ExLlamaV2, llama.cpp, transformers, AutoGPTQ, and others. This flexibility is genuinely powerful — if a new quantization format drops tomorrow, ooba will probably support it within a week.
Winner for format support: Text Generation Web UI by a landslide. If you need to run EXL2 or GPTQ models, KoboldCpp simply can’t do it.
The Web Interface: Where You Actually Do Things
Text Generation Web UI
Ooba’s interface is built with Gradio and organized into tabs: Chat, Default (completion mode), Notebook, and a session tab. The chat interface supports character cards (more on those later), conversation history, and multiple chat modes including instruct mode, chat mode, and chat-instruct mode.
The UI is dense. Your generation parameters live in a sidebar with sliders for temperature, top_p, top_k, typical_p, repetition penalty, min_p, and about fifteen other things. There are tabs within tabs. There are dropdowns that reveal more dropdowns. If you’re the kind of person who likes having thirty knobs to turn, this is paradise. If you’re the kind of person who gets overwhelmed at the Cheesecake Factory menu, you might want to sit down.
On the plus side, the UI is highly functional. You can:
- Switch between models without restarting
- Load and save character cards
- Manage multiple conversations
- Install extensions through the UI
- Monitor VRAM usage
- Adjust generation parameters in real-time
KoboldCpp
KoboldCpp’s interface is cleaner and more focused. It has a built-in web UI called KoboldAI Lite that loads in your browser. The interface is oriented toward creative writing and roleplay by default — it shows a story/conversation view with generation settings accessible through a settings panel.
The UI feels more like a writing tool and less like a machine learning experiment dashboard. Parameters are available but not overwhelming. There’s a “Story” mode and an “Adventure” mode (for AI dungeon-style interactive fiction), plus standard chat functionality.
KoboldCpp also launches with a configuration GUI on desktop platforms — a window where you set your model path, GPU layers, context size, and other options before the server starts. It’s a nice touch that eliminates the “wait, what command-line flags do I need?” guessing game.
Winner for UI: This one’s subjective. KoboldCpp wins for approachability; ooba wins for feature density. If you want a cockpit with every instrument, ooba. If you want a clean dashboard, KoboldCpp.
Samplers Explained: The Spice Rack of Text Generation
Both tools expose sampler parameters that control how the model picks the next token. Here’s the crash course, since both UIs will throw these terms at you:
Temperature: Controls randomness. At 0, the model always picks the most likely next token (deterministic but boring). At 1.0, it samples proportionally to the probability distribution. Above 1.0, things get increasingly unhinged. Most people land between 0.7 and 1.0.
Top-K: Only consider the top K most likely tokens. Top-K of 40 means the model picks from its 40 best guesses. Keeps things coherent but can be too restrictive.
Top-P (Nucleus Sampling): Instead of a fixed number of tokens, include tokens until their cumulative probability reaches P. Top-P of 0.9 means “include enough tokens to cover 90% of the probability mass.” More adaptive than Top-K.
Min-P: A newer sampler that’s become popular. Sets a minimum probability threshold relative to the top token. If the top token has 50% probability and Min-P is 0.05, any token with less than 2.5% probability (50% x 0.05) gets excluded. Elegant and effective — many users now prefer Min-P over Top-K/Top-P.
Repetition Penalty: Reduces the probability of tokens that have already appeared. Prevents the model from getting stuck in loops like “The the the the the.” Values between 1.05 and 1.20 are common. Too high and the model starts avoiding common words unnaturally.
Typical-P: Filters tokens based on how “typical” they are — how close their information content is to the expected information content. It’s like telling the model “be surprising, but not too surprising.”
Mirostat: An entropy-targeting sampler that automatically adjusts its behavior to maintain a target “surprise level.” Set a target tau value and let it handle the rest. Available in both tools.
Both KoboldCpp and ooba expose all of these (and more). KoboldCpp has good defaults out of the box. Ooba gives you more rope — which is great if you know what you’re doing and hazardous if you don’t.
API Compatibility: Talking to Other Software
This is a big deal if you want to use your local model with other applications, scripts, or tools.
KoboldCpp
KoboldCpp provides:
- KoboldAI API (the classic KAI endpoint): Used by SillyTavern, TavernAI, and other frontends
- OpenAI-compatible API: Drop-in replacement for OpenAI’s API endpoints. Point your scripts at
http://localhost:5001/v1/chat/completionsand they work as if you’re talking to GPT-4 (content quality may vary, obviously)
The OpenAI-compatible API is particularly useful because an enormous ecosystem of tools speaks “OpenAI.” Changing one URL and removing an API key is way easier than rewriting your application.
Text Generation Web UI
Ooba provides:
- OpenAI-compatible API: Same deal — serves an endpoint that mimics OpenAI’s format
- Custom API: Ooba’s own API format, which some tools support natively
- The API is accessible on a separate port or can share the main port
Both tools can serve as backends for SillyTavern, which is the most popular dedicated frontend for roleplay and character interactions. SillyTavern connects to either one seamlessly.
Winner for API compatibility: Roughly a tie. Both support OpenAI-compatible endpoints. KoboldCpp has a slight edge in the KAI ecosystem; ooba’s API has slightly more configuration options.
Extensions and Plugins: Bolting Things On
Text Generation Web UI
This is where ooba flexes. The extension system supports:
- AllTalk TTS: Text-to-speech for model outputs
- Whisper STT: Speech-to-text for voice input
- Long-term memory: RAG-style retrieval augmented generation
- Multimodal: Image input support for vision models (LLaVA, etc.)
- Training: LoRA fine-tuning directly from the UI
- API extensions: Custom API endpoints
- Gallery: Browse and install extensions through the UI
- Superboogav2: Enhanced RAG with chunk management
Extensions are Python scripts that hook into ooba’s event system. The community is active, and new extensions appear regularly. If you want your local LLM setup to do something exotic — like summarize PDFs, search the web, or generate images — there’s probably an extension for it.
KoboldCpp
KoboldCpp’s approach to extensibility is more conservative. It includes:
- Image generation: Built-in Stable Diffusion support (experimental)
- Whisper: Built-in speech-to-text
- Vision model support: For multimodal models
- LoRA loading: Apply LoRA adapters at load time
These features are compiled into the binary rather than loaded as plugins. There’s no plugin marketplace or extension ecosystem. The philosophy is more “we’ll add important features to the core” rather than “let the community build whatever.”
Winner for extensibility: Text Generation Web UI, easily. If you want a modular, extensible platform, ooba is the way.
Character Cards and Roleplay: Yes, People Use These for Stories
Both tools support character cards — JSON/PNG files that define a character’s personality, greeting message, example dialogues, and system prompt. The standard format is compatible with SillyTavern and the broader character card ecosystem (TavernAI cards, CharacterHub, etc.).
KoboldCpp has slightly more built-in support for interactive fiction modes (the Adventure mode is unique to it). Ooba treats character cards as one feature among many.
If roleplay and creative writing are your primary use case, most people end up using SillyTavern as the frontend with either ooba or KoboldCpp as the backend. In that scenario, the backend’s built-in character card support matters less.
Performance: The Numbers Game
Let’s talk speed, because waiting 30 seconds for a response kills the vibe.
Raw Inference Speed
For GGUF models (the only common ground), both tools use llama.cpp under the hood, so raw token generation speed is essentially identical. KoboldCpp tracks the llama.cpp repo closely and often incorporates new optimizations quickly. Ooba’s llama.cpp backend also stays reasonably current.
For EXL2 models (ooba-only via ExLlamaV2), performance is typically faster than GGUF on NVIDIA GPUs when the model fits entirely in VRAM. ExLlamaV2 is extremely well-optimized for NVIDIA hardware. If you have a 24GB RTX 4090 and your model fits in VRAM, EXL2 is probably the fastest way to run it.
Startup Time
KoboldCpp loads faster. It’s a compiled binary that starts almost instantly, then loads the model. Ooba has to initialize Python, load Gradio, initialize the selected backend, and then load the model. The difference is usually 10-30 seconds, not minutes, but it’s noticeable.
Memory Usage
KoboldCpp is leaner on system RAM since it’s not running a Python runtime. For the actual model, memory usage is comparable between the two when using the same format and quantization.
Context Size
Both support large context sizes (up to 128K+ tokens depending on the model), but handling massive contexts requires sufficient RAM/VRAM. KoboldCpp has good support for context-extending techniques like RoPE scaling and YaRN, configured at load time. Ooba supports these through backend-specific settings.
Winner for performance: KoboldCpp has lower overhead; ooba wins if you can use EXL2 on NVIDIA hardware. For GGUF-to-GGUF, they’re neck and neck.
GPU Support: Who Can Play?
NVIDIA: Both tools work great. Full CUDA support, well-tested, most models are benchmarked on NVIDIA hardware. This is the path of least resistance.
AMD: KoboldCpp supports AMD GPUs through Vulkan (cross-platform, works well) and ROCm. Ooba supports AMD through ROCm, which can be finicky but works. KoboldCpp’s Vulkan support is generally easier to set up on AMD.
Intel Arc: KoboldCpp has Vulkan support that works on Intel GPUs. Ooba has experimental Intel support through various backends.
Apple Silicon: Both work on M1/M2/M3/M4 Macs. KoboldCpp uses Metal acceleration (built into llama.cpp). Ooba supports Metal through its llama.cpp backend and also has MPS support for some backends.
CPU-only: Both support CPU inference. It’s slow, but it works. KoboldCpp is slightly more optimized for CPU-only scenarios.
Winner for GPU support: KoboldCpp, mainly because Vulkan is more universally compatible than ROCm/CUDA-specific paths.
When to Use Which: The Decision Tree
Use Text Generation Web UI when:
- You want to run EXL2 or GPTQ models (NVIDIA GPU required)
- You need the extension ecosystem (TTS, STT, training, RAG)
- You want to experiment with multiple backends and model formats
- You’re comfortable with Python environments and occasional troubleshooting
- You want LoRA fine-tuning from a web UI
- You’re building a complex local AI setup with multiple capabilities
Use KoboldCpp when:
- You want the simplest possible setup (especially on Windows)
- You’re running GGUF models and don’t need other formats
- You have an AMD or Intel GPU (Vulkan support is great)
- You want low overhead and fast startup
- You’re primarily doing chat or creative writing
- You want a backend for SillyTavern without the complexity
- You’re on a system where Python dependency management is painful
Use both when:
- You run different models for different tasks
- You want EXL2 for your main model and GGUF for quick experiments
- You have the disk space and enjoy having options
The Honest Summary
KoboldCpp is what happens when someone says “what if running a local LLM was actually easy?” and then follows through on it. It’s a single binary that loads models and works. The simplicity is genuine and refreshing.
Text Generation Web UI is what happens when someone says “what if we supported everything?” and then also follows through on it. It’s a sprawling, capable platform that can do almost anything in the local LLM space, at the cost of more complex setup and occasional dependency headaches.
Neither is objectively “better.” They’re different tools for different mindsets. KoboldCpp is a screwdriver — perfectly designed for its purpose. Ooba is a workshop — more setup, more maintenance, but you can build anything in it.
If you’re just getting started with local LLMs, start with KoboldCpp. Download a GGUF model from HuggingFace, load it up, and start chatting. You can always graduate to ooba later when you need more.
And if you’re already deep in the local LLM rabbit hole, you probably have both installed anyway. Don’t pretend you don’t.