The RAG Assumption
You want to ask a model about your company’s documentation. You can’t fit the entire docs into the model’s context window, so you use RAG (Retrieval-Augmented Generation): break docs into chunks, embed them, search for relevant chunks, stuff the top results into the prompt.
Sounds good. But here’s where people get it wrong: they guess at chunk size.
Common choices: 256 tokens, 512 tokens, “one paragraph”, “one section”. None of these are wrong, but most aren’t optimal for your actual workload.
The Tradeoff Visualized
Chunk Size = 100 tokens├─ Pro: Search is precise (fewer false positives)└─ Con: Loses context (answer might span multiple chunks)
Chunk Size = 500 tokens├─ Pro: Self-contained context (question+answer often fit)└─ Con: Semantic search gets noisy (many irrelevant matches)
Chunk Size = 2000 tokens├─ Pro: Rich context└─ Con: Model context fills up quickly, wastes retrievalHow Chunk Size Affects Search Quality
When you embed and search, smaller chunks are more precise but less informative. Larger chunks are richer but include more noise.
import requestsimport json
# Example: searching a knowledge base about Docker
doc = """Docker is a containerization platform. It uses images and containers.
Images are templates. Containers are running instances of images.You build images with Dockerfiles. You run containers with 'docker run'.
To list running containers: docker psTo see all containers: docker ps -aTo stop a container: docker stop <id>"""
# Scenario 1: 50-token chunkschunks_small = [ "Docker is a containerization platform. It uses images and containers.", "Images are templates. Containers are running instances of images.", "You build images with Dockerfiles. You run containers with 'docker run'.", "To list running containers: docker ps. To see all containers: docker ps -a.", "To stop a container: docker stop <id>"]
# Scenario 2: 200-token chunkschunks_large = [ "Docker is a containerization platform. It uses images and containers. Images are templates. Containers are running instances of images. You build images with Dockerfiles.", "To run containers: 'docker run'. To list running containers: docker ps. To see all: docker ps -a. To stop: docker stop <id>."]
# When searching for "how to list containers", small chunks match more precisely# When searching for "what is a container", large chunks give better contextSmaller chunks = higher recall but noisier results. Larger chunks = lower recall but richer context.
The Math: Optimal Chunk Size
There’s no universal optimal size, but research suggests 300–500 tokens for general knowledge bases.
Why? Because most questions fit within 500 tokens of context:
Typical question + answer:- Question: 20 tokens- Answer: 150–300 tokens- Total: 170–320 tokens
If your chunk is 256 tokens, the answer often fits entirely.If your chunk is 100 tokens, you need 2–3 chunks (search overhead).If your chunk is 1000 tokens, you waste context (search brings 700 irrelevant tokens).Practical Approach: Test Your Workload
Don’t guess. Test with your actual documents and queries.
#!/usr/bin/env python3import jsonfrom collections import defaultdict
# Chunk your documents at different sizesdef chunk_text(text, chunk_size_tokens): """Naive chunking by token count.""" tokens = text.split() # Oversimplified; use tiktoken in production chunks = [] for i in range(0, len(tokens), chunk_size_tokens): chunks.append(' '.join(tokens[i:i+chunk_size_tokens])) return chunks
# Test different chunk sizestest_query = "How do I stop a Docker container?"chunk_sizes = [100, 256, 512, 1024]
doc_text = open('docker_guide.txt').read()
for size in chunk_sizes: chunks = chunk_text(doc_text, size)
# Simulate retrieval (count how many chunks needed for a good answer) relevant = [] for chunk in chunks: if 'docker stop' in chunk.lower(): relevant.append(chunk)
print(f"Chunk size {size}: {len(relevant)} chunks needed to answer query") print(f" Total tokens retrieved: {size * len(relevant)}") print()
# Pick the chunk size that minimizes tokens retrieved# while still finding all necessary contextChunk Overlap: The Secret Weapon
Here’s what most people miss: overlapping chunks improve quality significantly without much overhead.
def chunk_text_with_overlap(text, chunk_size=500, overlap=100): """Chunk with 100-token overlap between chunks.""" tokens = text.split() chunks = []
for i in range(0, len(tokens), chunk_size - overlap): chunk = ' '.join(tokens[i:i + chunk_size]) if len(chunk) > 50: # Avoid tiny end chunks chunks.append(chunk)
return chunksOverlap means questions that span a chunk boundary now have their context preserved:
Without overlap:[Chunk 1: tokens 0-499][Chunk 2: tokens 500-999] ← If question is about tokens 499-501, you miss context
With 100-token overlap:[Chunk 1: tokens 0-499][Chunk 2: tokens 400-899] ← Both chunks now available in search resultsOverhead is minimal (10-20% more embeddings), and quality increases.
Document-Aware Chunking
The absolute best approach: break chunks at logical boundaries, not token counts.
def smart_chunk(text): """Break at section headers, then split large sections.""" sections = text.split('## ') # Markdown headers chunks = []
for section in sections: if len(section.split()) < 200: chunks.append(section) else: # Split this section into 500-token chunks tokens = section.split() for i in range(0, len(tokens), 500): chunks.append(' '.join(tokens[i:i+500]))
return chunksThis respects the document structure. Answers rarely span a section boundary, so you get coherent chunks.
Quick Decision Tree
- Small docs (<10 pages)? Use section-based chunking, no overlap.
- Medium knowledge base (100s of pages)? 400-token chunks, 50-token overlap.
- Large corpus (1000s of pages)? 512-token chunks, 100-token overlap, test with your queries.
- Dense technical docs? Smaller chunks (256 tokens). Overlap is critical.
- Long-form content (articles)? Larger chunks (800-1000 tokens).
The Reality
Chunk size matters, but it’s not a binary good-or-bad choice. Test with your actual queries and documents. Most teams overthink it. Pick a reasonable default (400-500 tokens), add overlap, and iterate.
Your RAG quality will improve 10x more from adding overlap than from agonizing over chunk size.