RAG Chunking: Why Chunk Size Is Everything

The RAG Assumption

You want to ask a model about your company’s documentation. You can’t fit the entire docs into the model’s context window, so you use RAG (Retrieval-Augmented Generation): break docs into chunks, embed them, search for relevant chunks, stuff the top results into the prompt.

Sounds good. But here’s where people get it wrong: they guess at chunk size.

Common choices: 256 tokens, 512 tokens, “one paragraph”, “one section”. None of these are wrong, but most aren’t optimal for your actual workload.

The Tradeoff Visualized

Chunk Size = 100 tokens
├─ Pro: Search is precise (fewer false positives)
└─ Con: Loses context (answer might span multiple chunks)

Chunk Size = 500 tokens
├─ Pro: Self-contained context (question+answer often fit)
└─ Con: Semantic search gets noisy (many irrelevant matches)

Chunk Size = 2000 tokens
├─ Pro: Rich context
└─ Con: Model context fills up quickly, wastes retrieval

How Chunk Size Affects Search Quality

When you embed and search, smaller chunks are more precise but less informative. Larger chunks are richer but include more noise.

import requests
import json

# Example: searching a knowledge base about Docker

doc = """
Docker is a containerization platform. It uses images and containers.

Images are templates. Containers are running instances of images.
You build images with Dockerfiles. You run containers with 'docker run'.

To list running containers: docker ps
To see all containers: docker ps -a
To stop a container: docker stop <id>
"""

# Scenario 1: 50-token chunks
chunks_small = [
    "Docker is a containerization platform. It uses images and containers.",
    "Images are templates. Containers are running instances of images.",
    "You build images with Dockerfiles. You run containers with 'docker run'.",
    "To list running containers: docker ps. To see all containers: docker ps -a.",
    "To stop a container: docker stop <id>"
]

# Scenario 2: 200-token chunks
chunks_large = [
    "Docker is a containerization platform. It uses images and containers. Images are templates. Containers are running instances of images. You build images with Dockerfiles.",
    "To run containers: 'docker run'. To list running containers: docker ps. To see all: docker ps -a. To stop: docker stop <id>."
]

# When searching for "how to list containers", small chunks match more precisely
# When searching for "what is a container", large chunks give better context

Smaller chunks = higher recall but noisier results. Larger chunks = lower recall but richer context.

The Math: Optimal Chunk Size

There’s no universal optimal size, but research suggests 300–500 tokens for general knowledge bases.

Why? Because most questions fit within 500 tokens of context:

Typical question + answer:
- Question: 20 tokens
- Answer: 150–300 tokens
- Total: 170–320 tokens

If your chunk is 256 tokens, the answer often fits entirely.
If your chunk is 100 tokens, you need 2–3 chunks (search overhead).
If your chunk is 1000 tokens, you waste context (search brings 700 irrelevant tokens).

Practical Approach: Test Your Workload

Don’t guess. Test with your actual documents and queries.

#!/usr/bin/env python3
import json
from collections import defaultdict

# Chunk your documents at different sizes
def chunk_text(text, chunk_size_tokens):
    """Naive chunking by token count."""
    tokens = text.split()  # Oversimplified; use tiktoken in production
    chunks = []
    for i in range(0, len(tokens), chunk_size_tokens):
        chunks.append(' '.join(tokens[i:i+chunk_size_tokens]))
    return chunks

# Test different chunk sizes
test_query = "How do I stop a Docker container?"
chunk_sizes = [100, 256, 512, 1024]

doc_text = open('docker_guide.txt').read()

for size in chunk_sizes:
    chunks = chunk_text(doc_text, size)

    # Simulate retrieval (count how many chunks needed for a good answer)
    relevant = []
    for chunk in chunks:
        if 'docker stop' in chunk.lower():
            relevant.append(chunk)

    print(f"Chunk size {size}: {len(relevant)} chunks needed to answer query")
    print(f"  Total tokens retrieved: {size * len(relevant)}")
    print()

# Pick the chunk size that minimizes tokens retrieved
# while still finding all necessary context

Chunk Overlap: The Secret Weapon

Here’s what most people miss: overlapping chunks improve quality significantly without much overhead.

def chunk_text_with_overlap(text, chunk_size=500, overlap=100):
    """Chunk with 100-token overlap between chunks."""
    tokens = text.split()
    chunks = []

    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = ' '.join(tokens[i:i + chunk_size])
        if len(chunk) > 50:  # Avoid tiny end chunks
            chunks.append(chunk)

    return chunks

Overlap means questions that span a chunk boundary now have their context preserved:

Without overlap:
[Chunk 1: tokens 0-499]
[Chunk 2: tokens 500-999]  ← If question is about tokens 499-501, you miss context

With 100-token overlap:
[Chunk 1: tokens 0-499]
[Chunk 2: tokens 400-899]  ← Both chunks now available in search results

Overhead is minimal (10-20% more embeddings), and quality increases.

Document-Aware Chunking

The absolute best approach: break chunks at logical boundaries, not token counts.

def smart_chunk(text):
    """Break at section headers, then split large sections."""
    sections = text.split('## ')  # Markdown headers
    chunks = []

    for section in sections:
        if len(section.split()) < 200:
            chunks.append(section)
        else:
            # Split this section into 500-token chunks
            tokens = section.split()
            for i in range(0, len(tokens), 500):
                chunks.append(' '.join(tokens[i:i+500]))

    return chunks

This respects the document structure. Answers rarely span a section boundary, so you get coherent chunks.

Quick Decision Tree

Small docs (<10 pages)? Use section-based chunking, no overlap.
Medium knowledge base (100s of pages)? 400-token chunks, 50-token overlap.
Large corpus (1000s of pages)? 512-token chunks, 100-token overlap, test with your queries.
Dense technical docs? Smaller chunks (256 tokens). Overlap is critical.
Long-form content (articles)? Larger chunks (800-1000 tokens).

The Reality

Chunk size matters, but it’s not a binary good-or-bad choice. Test with your actual queries and documents. Most teams overthink it. Pick a reasonable default (400-500 tokens), add overlap, and iterate.

Your RAG quality will improve 10x more from adding overlap than from agonizing over chunk size.