You know what’s fun? Asking ChatGPT a question about your company’s internal docs and watching it confidently hallucinate an answer that sounds right but is completely made up. Real fun. Like trusting a confident stranger for directions in a city they’ve never visited.
RAG fixes this. And you don’t need a cloud subscription, a GPU cluster, or a second mortgage to build one.
In this guide, we’re going to build a fully local Retrieval-Augmented Generation system using Ollama (free, local LLMs) and ChromaDB (free, local vector database). Everything runs on your machine. No API keys. No metered billing. No sending your proprietary data to someone else’s servers.
Let’s get into it.
What Even Is RAG?
RAG stands for Retrieval-Augmented Generation. It’s a pattern where instead of relying on an LLM’s training data (which is frozen in time and might be wrong about your specific stuff), you first retrieve relevant documents from a knowledge base, then feed those documents to the LLM as context so it can generate an informed answer.
Think of it like this: imagine you hire a really smart intern. They’re brilliant, well-read, articulate — but they know absolutely nothing about your company. RAG is the equivalent of handing them a folder of relevant docs before each question. “Here, read these first, then answer.”
Without RAG, that intern just wings it. With RAG, they actually have receipts.
The RAG Pipeline in 30 Seconds
- Ingest: Take your documents, split them into chunks, generate embeddings (numerical representations) for each chunk, and store them in a vector database.
- Query: When a user asks a question, convert the question into an embedding, search the vector database for the most similar chunks, and retrieve the top results.
- Generate: Send those retrieved chunks along with the user’s question to the LLM, which generates an answer grounded in your actual data.
That’s it. Three steps. The magic is in the details, which we’re about to cover.
Why Ollama + ChromaDB?
There are approximately nine thousand ways to build a RAG system. Most tutorials point you at OpenAI’s API and Pinecone. Those work great — if you’re cool with paying per token and shipping your data to external servers.
Here’s why we’re going local:
- Ollama runs open-source LLMs (Llama 3, Mistral, Phi-3, Gemma, etc.) locally. It handles model management, serving, and — critically — embedding generation. Free. No API key.
- ChromaDB is an open-source vector database designed for AI applications. It’s lightweight, runs locally or in Docker, and has a dead-simple Python API.
- Privacy: Your data never leaves your machine. This matters for proprietary docs, medical records, legal files, or anything you’d rather not upload to the cloud.
- Cost: $0/month. Forever. The only cost is your hardware’s electricity bill.
The trade-off? Local models are smaller and less capable than GPT-4 or Claude. But for document Q&A over your own knowledge base? They’re more than good enough.
Setting Up the Stack with Docker Compose
Let’s get everything running. We’ll use Docker Compose to spin up both Ollama and ChromaDB so you don’t have to deal with installing dependencies on your host machine.
Create a docker-compose.yml:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
# Uncomment the next two lines if you have an NVIDIA GPU
# deploy:
# resources:
# reservations:
# devices:
# - capabilities: [gpu]
chromadb:
image: chromadb/chroma:latest
container_name: chromadb
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
environment:
- ANONYMIZED_TELEMETRY=FALSE
volumes:
ollama_data:
chroma_data:
Fire it up:
docker compose up -d
Now pull the models you’ll need. We want a chat model and an embedding model:
# Chat model — llama3 is a solid all-rounder
docker exec ollama ollama pull llama3
# Embedding model — nomic-embed-text is great for RAG
docker exec ollama ollama pull nomic-embed-text
The embedding model is the unsung hero here. It converts text into vectors (arrays of numbers) that capture semantic meaning. nomic-embed-text produces 768-dimensional vectors and punches way above its weight for a model you can run on a laptop.
Give it a minute to download, then verify:
docker exec ollama ollama list
You should see both models listed. If you do, congratulations — you now have a local AI inference stack running. That was the hard part. (It wasn’t that hard.)
Python Project Setup
Create a project directory and set up a virtual environment:
mkdir rag-budget && cd rag-budget
python -m venv venv
source venv/bin/activate
pip install chromadb requests langchain langchain-community
We’re using requests to talk to Ollama’s REST API, chromadb for the vector store, and langchain for some helpful document processing utilities. You could do this without LangChain, but their text splitters save a lot of boilerplate.
Document Ingestion: Teaching Your System to Read
This is where we take your documents and prepare them for retrieval. The process has three phases: load, chunk, and embed.
Loading Documents
Let’s start simple with text files. Create a docs/ folder and drop some files in there:
import os
def load_documents(docs_dir: str) -> list[dict]:
"""Load all text files from a directory."""
documents = []
for filename in os.listdir(docs_dir):
if filename.endswith(('.txt', '.md')):
filepath = os.path.join(docs_dir, filename)
with open(filepath, 'r', encoding='utf-8') as f:
content = f.read()
documents.append({
'content': content,
'metadata': {
'source': filename,
'filepath': filepath
}
})
print(f"Loaded {len(documents)} documents")
return documents
Nothing fancy. For production, you’d want to handle PDFs, Word docs, HTML, etc. LangChain has loaders for all of those, but let’s keep it focused.
Chunking: The Art of Splitting Text
Here’s where a lot of RAG systems silently go wrong. You can’t just shove an entire 50-page document into a vector database and expect good results. You need to split it into chunks — but how you split matters enormously.
Why chunking matters: Vector search finds the chunks most similar to your query. If your chunks are too big, they contain too much irrelevant noise and the signal gets diluted. If they’re too small, they lack enough context to be useful. It’s a Goldilocks problem.
Here are the main strategies:
Fixed-Size Chunking (Simple but Effective)
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_documents(documents: list[dict],
chunk_size: int = 500,
chunk_overlap: int = 50) -> list[dict]:
"""Split documents into overlapping chunks."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = []
for doc in documents:
splits = splitter.split_text(doc['content'])
for i, split in enumerate(splits):
chunks.append({
'content': split,
'metadata': {
**doc['metadata'],
'chunk_index': i,
'chunk_total': len(splits)
}
})
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
return chunks
The RecursiveCharacterTextSplitter is smart about where it cuts. It tries to split at paragraph breaks first, then sentences, then words. The chunk_overlap parameter creates overlap between consecutive chunks so you don’t lose context at the boundaries.
Choosing Chunk Size
Here’s my rule of thumb:
| Chunk Size | Best For | Trade-off |
|---|---|---|
| 200-300 chars | Precise factual Q&A | May lack context |
| 500-800 chars | General document Q&A | Good balance (start here) |
| 1000-1500 chars | Summarization, complex topics | More noise per chunk |
Start with 500 characters and 50 character overlap. Adjust based on your results. This is the single most impactful tuning knob in your entire RAG system, and most people barely touch it.
Generating Embeddings with Ollama
Now we convert each chunk into a vector using Ollama’s embedding API:
import requests
from typing import list
OLLAMA_BASE_URL = "http://localhost:11434"
def get_embedding(text: str, model: str = "nomic-embed-text") -> list[float]:
"""Generate embedding for a single text using Ollama."""
response = requests.post(
f"{OLLAMA_BASE_URL}/api/embeddings",
json={"model": model, "prompt": text}
)
response.raise_for_status()
return response.json()["embedding"]
def get_embeddings_batch(texts: list[str],
model: str = "nomic-embed-text") -> list[list[float]]:
"""Generate embeddings for multiple texts."""
embeddings = []
for i, text in enumerate(texts):
embedding = get_embedding(text, model)
embeddings.append(embedding)
if (i + 1) % 50 == 0:
print(f" Embedded {i + 1}/{len(texts)} chunks...")
return embeddings
Each call to the embedding endpoint takes a string and returns a 768-dimensional vector. On a decent CPU, expect around 10-30 chunks per second. With a GPU, it’s significantly faster.
Storing in ChromaDB
Now we stick those embeddings into ChromaDB:
import chromadb
def create_collection(chunks: list[dict],
collection_name: str = "knowledge_base"):
"""Create a ChromaDB collection and add document chunks."""
client = chromadb.HttpClient(host="localhost", port=8000)
# Delete existing collection if it exists (for re-indexing)
try:
client.delete_collection(collection_name)
except Exception:
pass
collection = client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
# Prepare data for batch insertion
documents = [chunk['content'] for chunk in chunks]
metadatas = [chunk['metadata'] for chunk in chunks]
ids = [f"chunk_{i}" for i in range(len(chunks))]
# Generate embeddings
print("Generating embeddings...")
embeddings = get_embeddings_batch(documents)
# Add to collection in batches
batch_size = 100
for i in range(0, len(documents), batch_size):
end = min(i + batch_size, len(documents))
collection.add(
documents=documents[i:end],
embeddings=embeddings[i:end],
metadatas=metadatas[i:end],
ids=ids[i:end]
)
print(f"Added {len(documents)} chunks to collection '{collection_name}'")
return collection
The hnsw:space: cosine setting tells ChromaDB to use cosine similarity for vector comparison. This is the standard choice for text embeddings — it measures the angle between vectors rather than the distance, which works better for comparing semantic meaning.
Putting Ingestion Together
Here’s the complete ingestion pipeline:
def ingest_documents(docs_dir: str = "docs",
collection_name: str = "knowledge_base"):
"""Full ingestion pipeline: load -> chunk -> embed -> store."""
# Load
documents = load_documents(docs_dir)
if not documents:
print("No documents found!")
return None
# Chunk
chunks = chunk_documents(documents, chunk_size=500, chunk_overlap=50)
# Embed and store
collection = create_collection(chunks, collection_name)
return collection
# Run it
collection = ingest_documents("docs")
Drop some text files in docs/, run this script, and your knowledge base is built. That’s the hardest part done.
Querying: Asking Questions
Now for the fun part. Let’s build the query pipeline that retrieves relevant chunks and generates answers.
Retrieval
def retrieve_context(query: str,
collection_name: str = "knowledge_base",
n_results: int = 5) -> list[dict]:
"""Retrieve the most relevant chunks for a query."""
client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_collection(collection_name)
# Generate query embedding
query_embedding = get_embedding(query)
# Search
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
# Format results
contexts = []
for i in range(len(results['documents'][0])):
contexts.append({
'content': results['documents'][0][i],
'metadata': results['metadatas'][0][i],
'distance': results['distances'][0][i]
})
return contexts
This takes a question, embeds it with the same model used for the documents (this is important — always use the same embedding model for ingestion and querying), searches ChromaDB for the nearest vectors, and returns the top matches.
Generation
Now we feed those retrieved chunks to Ollama’s chat model:
def generate_answer(query: str,
contexts: list[dict],
model: str = "llama3") -> str:
"""Generate an answer using retrieved context."""
# Build context string
context_text = "\n\n---\n\n".join([
f"[Source: {ctx['metadata'].get('source', 'unknown')}]\n{ctx['content']}"
for ctx in contexts
])
# Build prompt
prompt = f"""You are a helpful assistant. Answer the user's question based ONLY
on the provided context. If the context doesn't contain enough information to answer
the question, say so — do not make up information.
CONTEXT:
{context_text}
QUESTION: {query}
ANSWER:"""
# Call Ollama
response = requests.post(
f"{OLLAMA_BASE_URL}/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.3, # Lower = more factual
"num_ctx": 4096 # Context window size
}
}
)
response.raise_for_status()
return response.json()["response"]
A few key details in that prompt:
- “based ONLY on the provided context” — This is the anti-hallucination instruction. Without it, the model will happily fill gaps with made-up information.
- Temperature 0.3 — Lower temperature means more deterministic, factual responses. For creative writing, crank it up. For document Q&A, keep it low.
- Source attribution — We include the source file in the context so the model can reference where information came from.
The Complete Query Function
def ask(query: str,
collection_name: str = "knowledge_base",
n_results: int = 5,
model: str = "llama3") -> str:
"""Full RAG pipeline: retrieve context, then generate answer."""
print(f"\nQuery: {query}")
print("Retrieving relevant context...")
contexts = retrieve_context(query, collection_name, n_results)
print(f"Found {len(contexts)} relevant chunks:")
for i, ctx in enumerate(contexts):
source = ctx['metadata'].get('source', 'unknown')
distance = ctx['distance']
print(f" [{i+1}] {source} (similarity: {1 - distance:.3f})")
print("Generating answer...")
answer = generate_answer(query, contexts, model)
print(f"\nAnswer: {answer}")
return answer
# Try it out
ask("What is our refund policy?")
ask("How do I configure the API authentication?")
That’s your complete RAG system. Load docs, chunk them, embed them, store them, retrieve them, generate answers. All local. All free.
Real-World Walkthrough: Internal Docs Chatbot
Let’s make this concrete. Say you’re building a chatbot that answers questions about your company’s internal documentation — onboarding guides, API docs, HR policies, engineering runbooks.
Step 1: Organize Your Docs
docs/
├── onboarding/
│ ├── getting-started.md
│ ├── dev-environment-setup.md
│ └── team-structure.md
├── engineering/
│ ├── api-reference.md
│ ├── deployment-guide.md
│ └── incident-response.md
└── hr/
├── pto-policy.md
├── expense-reports.md
└── benefits-guide.md
Step 2: Enhance the Loader for Subdirectories
import os
def load_documents_recursive(docs_dir: str) -> list[dict]:
"""Load documents from nested directory structure."""
documents = []
for root, dirs, files in os.walk(docs_dir):
for filename in files:
if filename.endswith(('.txt', '.md')):
filepath = os.path.join(root, filename)
# Get relative path for better source tracking
rel_path = os.path.relpath(filepath, docs_dir)
category = os.path.dirname(rel_path) or "general"
with open(filepath, 'r', encoding='utf-8') as f:
content = f.read()
documents.append({
'content': content,
'metadata': {
'source': filename,
'filepath': rel_path,
'category': category
}
})
print(f"Loaded {len(documents)} documents from {docs_dir}")
return documents
Step 3: Add a Simple Chat Loop
def chat():
"""Interactive chat loop for your knowledge base."""
print("=" * 50)
print(" Internal Docs Assistant")
print(" Type 'quit' to exit, 'reindex' to rebuild")
print("=" * 50)
while True:
query = input("\nYou: ").strip()
if not query:
continue
if query.lower() == 'quit':
print("Goodbye!")
break
if query.lower() == 'reindex':
ingest_documents("docs")
print("Re-indexed!")
continue
answer = ask(query)
print(f"\nAssistant: {answer}")
chat()
That’s it. You now have an internal docs chatbot. Your team can ask natural-language questions about your documentation and get answers grounded in your actual content.
Performance Tips
Now that you have a working system, here are some ways to make it better:
1. Tune Your Chunk Size Empirically
Don’t just guess. Try different chunk sizes (300, 500, 800, 1200) on the same set of test questions and compare answer quality. The “right” size depends entirely on your documents. Technical API docs with dense information often do better with smaller chunks. Narrative documents like policies and guides can handle larger ones.
2. Use Metadata Filtering
ChromaDB supports metadata filters. If your user asks a question and you know it’s about HR policies, filter the search:
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={"category": "hr"} # Only search HR docs
)
This dramatically improves relevance when you have diverse document types.
3. Re-rank Results
Vector similarity isn’t perfect. A cheap trick that helps: retrieve more results than you need (say, 10-15), then use the LLM itself to re-rank them by relevance before using the top 5 for generation. It adds latency but improves answer quality significantly.
4. Cache Embeddings
If your documents don’t change often, cache the embeddings. Don’t re-generate them every time you restart. ChromaDB persists data to disk by default with our Docker setup, so your embeddings survive container restarts.
5. Consider Your Hardware
- CPU-only: Works fine. Expect 2-5 seconds for embedding, 10-30 seconds for generation with Llama 3 8B.
- NVIDIA GPU: Uncomment the GPU section in the Docker Compose file. Embedding becomes near-instant, generation drops to 2-5 seconds.
- Apple Silicon: Run Ollama natively (not in Docker) to leverage the Metal GPU. Huge performance boost.
If you’re on an M1/M2/M3 Mac, install Ollama directly via brew install ollama instead of Docker — you’ll get significantly better performance through Metal acceleration.
When RAG Beats Fine-Tuning
People often ask: “Should I fine-tune a model on my data or use RAG?” Here’s the decision framework:
Choose RAG when:
- Your data changes frequently (docs get updated, new content added)
- You need source attribution (showing where answers came from)
- You want to keep your base model general-purpose
- Your data is proprietary and you don’t want it baked into model weights
- You need to get something working this week, not this quarter
Choose fine-tuning when:
- You need the model to adopt a specific tone, style, or behavior
- Your task is highly specialized (medical coding, legal classification)
- Retrieval latency is unacceptable for your use case
- Your knowledge is static and well-defined
Choose both when:
- You want the model to understand your domain deeply (fine-tune) AND stay current with changing data (RAG)
For most teams building internal tools, RAG is the right choice. It’s faster to set up, easier to maintain, and your data stays separate from the model — which means you can swap models without re-training.
Common Pitfalls (Learn from My Mistakes)
1. Using different embedding models for ingestion and querying. The vectors have to live in the same space. If you embed docs with nomic-embed-text and queries with all-minilm, your results will be garbage. Always match your models.
2. Not handling document updates. When a document changes, you need to re-ingest it. Build a simple mechanism to detect changes (file modification timestamps, hashes) and re-index only what changed.
3. Ignoring chunk boundaries. If a critical piece of information spans two chunks and neither chunk captures the full context, your retrieval will miss it. That’s what chunk_overlap is for. Don’t set it to zero.
4. Stuffing too much context into the prompt. More retrieved chunks isn’t always better. If you stuff 20 chunks into the prompt, the LLM has to parse through a wall of text and might get confused or miss the relevant part. 3-5 well-chosen chunks usually outperform 15 mediocre ones.
5. Skipping evaluation. Build a small test set of question-answer pairs. Run your RAG system against them periodically to measure quality. Without this, you’re tuning blind.
Wrapping Up
You now have everything you need to build a local, private, free RAG system. The stack is simple: Ollama for embeddings and generation, ChromaDB for vector storage, and some Python glue to tie it all together.
Is it going to outperform a RAG system built on GPT-4 Turbo and Pinecone with a dedicated ML engineering team? No. But it’ll run on your laptop, cost you nothing, keep your data private, and be more than capable enough for internal docs, personal knowledge bases, and proof-of-concept projects.
The barrier to entry for AI-powered knowledge systems has essentially dropped to zero. The only thing standing between you and a custom AI that actually knows your stuff is about an hour of setup and a docker compose up.
Go build something.