The Confusion
You see a model spec: “Llama 2 70B, 4K context window.” Then you see another: “Mistral 7B, 32K context.” Then you see output token limits, input token limits, and suddenly you’re reading a paper just to figure out what you can actually do.
Here’s the thing: they’re not the same thing, and confusing them will bite you.
Context Window: What the Model Can See
Your context window is the maximum amount of text the model can look at in a single request. It’s measured in tokens.
A token is roughly 4 characters. Not exactly—it’s language-dependent—but close enough.
8K context window = ~32,000 characters of text32K context window = ~128,000 characters of textThink of it like a person’s short-term memory. If someone’s short-term memory can hold 8,000 tokens, they can read and understand your entire conversation only if it fits in 8,000 tokens. Beyond that, they forget the earlier parts.
Token Limit: What the Model Can Output
Token limit (or “max output tokens”) is the maximum length of the response the model can generate. Also measured in tokens.
Many models let you set this per-request:
curl http://localhost:11434/api/generate -d '{ "model": "mistral", "prompt": "Explain quantum computing", "num_predict": 500, "stream": false}'num_predict: 500 = “generate up to 500 tokens in your response.”
This is independent of context window. You could have a 32K context window but ask for only 100 tokens of output.
How They Interact
Let’s say you’re using Claude (200K context window):
- Input: Your entire conversation history + new prompt. Total: 180K tokens.
- Output: Claude can generate up to some limit (usually 4K tokens for Claude 3).
The model processes all 180K tokens of context to generate those 4K output tokens.
Now you’re using Mistral (32K context):
- Input: Your conversation history + new prompt. If that’s 25K tokens, you’re fine. If it’s 33K tokens, you hit the wall—the model can’t see the full context.
- Output: Can be up to 32K tokens (typically capped lower for practical reasons).
Real-World Numbers
Model Context Window Output Limit Use Case------ ---------------- --------------- --------Llama 2 7B 4K 4K Budget inferenceMistral 7B 32K 8K More context, still efficientMistral 48x8B 32K 16K Mixture of expertsClaude 3 200K 4K Document analysisGPT-4 128K 4K Extended reasoningPractical Implications
If your context window is too small: You can’t feed the model a long document. Trying to RAG (Retrieval-Augmented Generation) across a 10,000-word article with a 4K context model? You’ll have to chunk aggressively, losing context.
If your output limit is too small: The model cuts off mid-response. Common in free API tiers.
If you set output limit to match context window: You’re wasting it. If you ask for 32K of output, the model’s internal tokens are consumed generating text, not processing your input.
# Example: RAG workflow with token budgetingcontext_window = 8192 # Mistralreserved_for_output = 1024 # Leave room for responseavailable_for_input = context_window - reserved_for_output # 7168
# Now chunk your documents to fit in 7168 tokens# Each chunk becomes part of the contextfor chunk in documents: if len(tokenize(chunk)) > available_for_input: # Too big, split further passThe Gotcha: Effective vs. Theoretical
Some models claim larger context windows than they’re effectively good at using. A model trained with 4K context window, then fine-tuned to accept 32K? It often performs worse on content beyond ~8K. The weights aren’t calibrated for it.
This is why “effective context window” is a thing people test empirically. The advertised number is the maximum, not the recommended.
Token Counting Before You Send
If you’re building applications, count tokens before sending — not after getting a truncation error:
# Using tiktoken (OpenAI-compatible tokenizer, works for most models)import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizertext = "Your long document here..."token_count = len(enc.encode(text))print(f"Tokens: {token_count}")
# For Ollama/llama.cpp models, use a rough rule:# 1 token ≈ 0.75 words ≈ 4 charactersestimated_tokens = len(text) / 4Rough rule: 1 token ≈ 4 characters in English. A 10,000-word article is about 13,000 tokens. Plan your context budget accordingly.
Model-Specific Context Windows
Just for reference — common models and their actual limits:
| Model | Context Window |
|---|---|
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Llama 3.1 8B | 128K tokens |
| Mistral 7B | 32K tokens |
| Gemma 2 9B | 8K tokens |
Gemma 2 at 8K is tight. Mistral at 32K is comfortable for most tasks. But remember: bigger context = more VRAM when running locally.
Quick Checklist
- Writing a chatbot? Context window matters more (conversation history). Output limit is usually fine at 2-4K tokens.
- Summarizing documents? Ensure context window fits your doc + prompt.
- Generating code? Set output limit explicitly. Don’t let it ramble forever.
- Running locally? Smaller context window = faster inference, less VRAM. Make the tradeoff consciously.
Don’t mix them up, and your prompts will actually work the way you expect.