Skip to content
Go back

Context Window vs Token Limit: Not the Same Thing

By SumGuy 5 min read
Context Window vs Token Limit: Not the Same Thing

The Confusion

You see a model spec: “Llama 2 70B, 4K context window.” Then you see another: “Mistral 7B, 32K context.” Then you see output token limits, input token limits, and suddenly you’re reading a paper just to figure out what you can actually do.

Here’s the thing: they’re not the same thing, and confusing them will bite you.

Context Window: What the Model Can See

Your context window is the maximum amount of text the model can look at in a single request. It’s measured in tokens.

A token is roughly 4 characters. Not exactly—it’s language-dependent—but close enough.

8K context window = ~32,000 characters of text
32K context window = ~128,000 characters of text

Think of it like a person’s short-term memory. If someone’s short-term memory can hold 8,000 tokens, they can read and understand your entire conversation only if it fits in 8,000 tokens. Beyond that, they forget the earlier parts.

Token Limit: What the Model Can Output

Token limit (or “max output tokens”) is the maximum length of the response the model can generate. Also measured in tokens.

Many models let you set this per-request:

Terminal window
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Explain quantum computing",
"num_predict": 500,
"stream": false
}'

num_predict: 500 = “generate up to 500 tokens in your response.”

This is independent of context window. You could have a 32K context window but ask for only 100 tokens of output.

How They Interact

Let’s say you’re using Claude (200K context window):

The model processes all 180K tokens of context to generate those 4K output tokens.

Now you’re using Mistral (32K context):

Real-World Numbers

Model Context Window Output Limit Use Case
------ ---------------- --------------- --------
Llama 2 7B 4K 4K Budget inference
Mistral 7B 32K 8K More context, still efficient
Mistral 48x8B 32K 16K Mixture of experts
Claude 3 200K 4K Document analysis
GPT-4 128K 4K Extended reasoning

Practical Implications

If your context window is too small: You can’t feed the model a long document. Trying to RAG (Retrieval-Augmented Generation) across a 10,000-word article with a 4K context model? You’ll have to chunk aggressively, losing context.

If your output limit is too small: The model cuts off mid-response. Common in free API tiers.

If you set output limit to match context window: You’re wasting it. If you ask for 32K of output, the model’s internal tokens are consumed generating text, not processing your input.

# Example: RAG workflow with token budgeting
context_window = 8192 # Mistral
reserved_for_output = 1024 # Leave room for response
available_for_input = context_window - reserved_for_output # 7168
# Now chunk your documents to fit in 7168 tokens
# Each chunk becomes part of the context
for chunk in documents:
if len(tokenize(chunk)) > available_for_input:
# Too big, split further
pass

The Gotcha: Effective vs. Theoretical

Some models claim larger context windows than they’re effectively good at using. A model trained with 4K context window, then fine-tuned to accept 32K? It often performs worse on content beyond ~8K. The weights aren’t calibrated for it.

This is why “effective context window” is a thing people test empirically. The advertised number is the maximum, not the recommended.

Token Counting Before You Send

If you’re building applications, count tokens before sending — not after getting a truncation error:

# Using tiktoken (OpenAI-compatible tokenizer, works for most models)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizer
text = "Your long document here..."
token_count = len(enc.encode(text))
print(f"Tokens: {token_count}")
# For Ollama/llama.cpp models, use a rough rule:
# 1 token ≈ 0.75 words ≈ 4 characters
estimated_tokens = len(text) / 4

Rough rule: 1 token ≈ 4 characters in English. A 10,000-word article is about 13,000 tokens. Plan your context budget accordingly.

Model-Specific Context Windows

Just for reference — common models and their actual limits:

ModelContext Window
GPT-4o128K tokens
Claude 3.5 Sonnet200K tokens
Llama 3.1 8B128K tokens
Mistral 7B32K tokens
Gemma 2 9B8K tokens

Gemma 2 at 8K is tight. Mistral at 32K is comfortable for most tasks. But remember: bigger context = more VRAM when running locally.

Quick Checklist

Don’t mix them up, and your prompts will actually work the way you expect.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
Nginx Proxy Manager for Normal Humans
Next Post
Disk Space Tools in 2026: Beyond du and df

Related Posts