Context Window vs Token Limit: Not the Same Thing

The Confusion

You see a model spec: “Llama 2 70B, 4K context window.” Then you see another: “Mistral 7B, 32K context.” Then you see output token limits, input token limits, and suddenly you’re reading a paper just to figure out what you can actually do.

Here’s the thing: they’re not the same thing, and confusing them will bite you.

Context Window: What the Model Can See

Your context window is the maximum amount of text the model can look at in a single request. It’s measured in tokens.

A token is roughly 4 characters. Not exactly—it’s language-dependent—but close enough.

8K context window = ~32,000 characters of text
32K context window = ~128,000 characters of text

Think of it like a person’s short-term memory. If someone’s short-term memory can hold 8,000 tokens, they can read and understand your entire conversation only if it fits in 8,000 tokens. Beyond that, they forget the earlier parts.

Token Limit: What the Model Can Output

Token limit (or “max output tokens”) is the maximum length of the response the model can generate. Also measured in tokens.

Many models let you set this per-request:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Explain quantum computing",
  "num_predict": 500,
  "stream": false
}'

num_predict: 500 = “generate up to 500 tokens in your response.”

This is independent of context window. You could have a 32K context window but ask for only 100 tokens of output.

How They Interact

Let’s say you’re using Claude (200K context window):

Input: Your entire conversation history + new prompt. Total: 180K tokens.
Output: Claude can generate up to some limit (usually 4K tokens for Claude 3).

The model processes all 180K tokens of context to generate those 4K output tokens.

Now you’re using Mistral (32K context):

Input: Your conversation history + new prompt. If that’s 25K tokens, you’re fine. If it’s 33K tokens, you hit the wall—the model can’t see the full context.
Output: Can be up to 32K tokens (typically capped lower for practical reasons).

Real-World Numbers

Model           Context Window    Output Limit    Use Case
------          ----------------  ---------------  --------
Llama 2 7B      4K                4K              Budget inference
Mistral 7B      32K               8K              More context, still efficient
Mistral 48x8B   32K               16K             Mixture of experts
Claude 3        200K              4K              Document analysis
GPT-4           128K              4K              Extended reasoning

Practical Implications

If your context window is too small: You can’t feed the model a long document. Trying to RAG (Retrieval-Augmented Generation) across a 10,000-word article with a 4K context model? You’ll have to chunk aggressively, losing context.

If your output limit is too small: The model cuts off mid-response. Common in free API tiers.

If you set output limit to match context window: You’re wasting it. If you ask for 32K of output, the model’s internal tokens are consumed generating text, not processing your input.

# Example: RAG workflow with token budgeting
context_window = 8192      # Mistral
reserved_for_output = 1024 # Leave room for response
available_for_input = context_window - reserved_for_output  # 7168

# Now chunk your documents to fit in 7168 tokens
# Each chunk becomes part of the context
for chunk in documents:
    if len(tokenize(chunk)) > available_for_input:
        # Too big, split further
        pass

The Gotcha: Effective vs. Theoretical

Some models claim larger context windows than they’re effectively good at using. A model trained with 4K context window, then fine-tuned to accept 32K? It often performs worse on content beyond ~8K. The weights aren’t calibrated for it.

This is why “effective context window” is a thing people test empirically. The advertised number is the maximum, not the recommended.

Token Counting Before You Send

If you’re building applications, count tokens before sending — not after getting a truncation error:

# Using tiktoken (OpenAI-compatible tokenizer, works for most models)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer
text = "Your long document here..."
token_count = len(enc.encode(text))
print(f"Tokens: {token_count}")

# For Ollama/llama.cpp models, use a rough rule:
# 1 token ≈ 0.75 words ≈ 4 characters
estimated_tokens = len(text) / 4

Rough rule: 1 token ≈ 4 characters in English. A 10,000-word article is about 13,000 tokens. Plan your context budget accordingly.

Model-Specific Context Windows

Just for reference — common models and their actual limits:

Model	Context Window
GPT-4o	128K tokens
Claude 3.5 Sonnet	200K tokens
Llama 3.1 8B	128K tokens
Mistral 7B	32K tokens
Gemma 2 9B	8K tokens

Gemma 2 at 8K is tight. Mistral at 32K is comfortable for most tasks. But remember: bigger context = more VRAM when running locally.

Quick Checklist

Writing a chatbot? Context window matters more (conversation history). Output limit is usually fine at 2-4K tokens.
Summarizing documents? Ensure context window fits your doc + prompt.
Generating code? Set output limit explicitly. Don’t let it ramble forever.
Running locally? Smaller context window = faster inference, less VRAM. Make the tradeoff consciously.

Don’t mix them up, and your prompts will actually work the way you expect.

Context Window vs Token Limit: Not the Same Thing

The Confusion

Context Window: What the Model Can See

Token Limit: What the Model Can Output

How They Interact

Real-World Numbers

Practical Implications

The Gotcha: Effective vs. Theoretical

Token Counting Before You Send

Model-Specific Context Windows

Quick Checklist

Responses from around the web

Discussion

Related Posts

Self-Supervised Learning Explained

Ollama Model Management: Beyond ollama run

Continue.dev vs Cody vs Tabby: AI Code Help Without the Cloud

LangGraph vs CrewAI vs AutoGen: AI Agents Without the Hype