RAG on a Budget: Building a Knowledge Base with Ollama & ChromaDB

You know what’s fun? Asking ChatGPT a question about your company’s internal docs and watching it confidently hallucinate an answer that sounds right but is completely made up. Real fun. Like trusting a confident stranger for directions in a city they’ve never visited.

RAG fixes this. And you don’t need a cloud subscription, a GPU cluster, or a second mortgage to build one.

In this guide, we’re going to build a fully local Retrieval-Augmented Generation system using Ollama (free, local LLMs) and ChromaDB (free, local vector database). Everything runs on your machine. No API keys. No metered billing. No sending your proprietary data to someone else’s servers.

Let’s get into it.

What Even Is RAG?

RAG stands for Retrieval-Augmented Generation. It’s a pattern where instead of relying on an LLM’s training data (which is frozen in time and might be wrong about your specific stuff), you first retrieve relevant documents from a knowledge base, then feed those documents to the LLM as context so it can generate an informed answer.

Think of it like this: imagine you hire a really smart intern. They’re brilliant, well-read, articulate — but they know absolutely nothing about your company. RAG is the equivalent of handing them a folder of relevant docs before each question. “Here, read these first, then answer.”

Without RAG, that intern just wings it. With RAG, they actually have receipts.

The RAG Pipeline in 30 Seconds

Ingest: Take your documents, split them into chunks, generate embeddings (numerical representations) for each chunk, and store them in a vector database.
Query: When a user asks a question, convert the question into an embedding, search the vector database for the most similar chunks, and retrieve the top results.
Generate: Send those retrieved chunks along with the user’s question to the LLM, which generates an answer grounded in your actual data.

That’s it. Three steps. The magic is in the details, which we’re about to cover.

Why Ollama + ChromaDB?

There are approximately nine thousand ways to build a RAG system. Most tutorials point you at OpenAI’s API and Pinecone. Those work great — if you’re cool with paying per token and shipping your data to external servers.

Here’s why we’re going local:

Ollama runs open-source LLMs (Llama 3, Mistral, Phi-3, Gemma, etc.) locally. It handles model management, serving, and — critically — embedding generation. Free. No API key.
ChromaDB is an open-source vector database designed for AI applications. It’s lightweight, runs locally or in Docker, and has a dead-simple Python API.
Privacy: Your data never leaves your machine. This matters for proprietary docs, medical records, legal files, or anything you’d rather not upload to the cloud.
Cost: $0/month. Forever. The only cost is your hardware’s electricity bill.

The trade-off? Local models are smaller and less capable than GPT-4 or Claude. But for document Q&A over your own knowledge base? They’re more than good enough.

Setting Up the Stack with Docker Compose

Let’s get everything running. We’ll use Docker Compose to spin up both Ollama and ChromaDB so you don’t have to deal with installing dependencies on your host machine.

Create a docker-compose.yml:

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Uncomment the next two lines if you have an NVIDIA GPU
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]

  chromadb:
    image: chromadb/chroma:latest
    container_name: chromadb
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma
    environment:
      - ANONYMIZED_TELEMETRY=FALSE

volumes:
  ollama_data:
  chroma_data:

Fire it up:

docker compose up -d

Now pull the models you’ll need. We want a chat model and an embedding model:

# Chat model — llama3 is a solid all-rounder
docker exec ollama ollama pull llama3

# Embedding model — nomic-embed-text is great for RAG
docker exec ollama ollama pull nomic-embed-text

The embedding model is the unsung hero here. It converts text into vectors (arrays of numbers) that capture semantic meaning. nomic-embed-text produces 768-dimensional vectors and punches way above its weight for a model you can run on a laptop.

Give it a minute to download, then verify:

docker exec ollama ollama list

You should see both models listed. If you do, congratulations — you now have a local AI inference stack running. That was the hard part. (It wasn’t that hard.)

Python Project Setup

Create a project directory and set up a virtual environment:

mkdir rag-budget && cd rag-budget
python -m venv venv
source venv/bin/activate
pip install chromadb requests langchain langchain-community

We’re using requests to talk to Ollama’s REST API, chromadb for the vector store, and langchain for some helpful document processing utilities. You could do this without LangChain, but their text splitters save a lot of boilerplate.

Document Ingestion: Teaching Your System to Read

This is where we take your documents and prepare them for retrieval. The process has three phases: load, chunk, and embed.

Loading Documents

Let’s start simple with text files. Create a docs/ folder and drop some files in there:

import os

def load_documents(docs_dir: str) -> list[dict]:
    """Load all text files from a directory."""
    documents = []
    for filename in os.listdir(docs_dir):
        if filename.endswith(('.txt', '.md')):
            filepath = os.path.join(docs_dir, filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                content = f.read()
            documents.append({
                'content': content,
                'metadata': {
                    'source': filename,
                    'filepath': filepath
                }
            })
    print(f"Loaded {len(documents)} documents")
    return documents

Nothing fancy. For production, you’d want to handle PDFs, Word docs, HTML, etc. LangChain has loaders for all of those, but let’s keep it focused.

Chunking: The Art of Splitting Text

Here’s where a lot of RAG systems silently go wrong. You can’t just shove an entire 50-page document into a vector database and expect good results. You need to split it into chunks — but how you split matters enormously.

Why chunking matters: Vector search finds the chunks most similar to your query. If your chunks are too big, they contain too much irrelevant noise and the signal gets diluted. If they’re too small, they lack enough context to be useful. It’s a Goldilocks problem.

Here are the main strategies:

Fixed-Size Chunking (Simple but Effective)

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_documents(documents: list[dict],
                    chunk_size: int = 500,
                    chunk_overlap: int = 50) -> list[dict]:
    """Split documents into overlapping chunks."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""]
    )

    chunks = []
    for doc in documents:
        splits = splitter.split_text(doc['content'])
        for i, split in enumerate(splits):
            chunks.append({
                'content': split,
                'metadata': {
                    **doc['metadata'],
                    'chunk_index': i,
                    'chunk_total': len(splits)
                }
            })

    print(f"Created {len(chunks)} chunks from {len(documents)} documents")
    return chunks

The RecursiveCharacterTextSplitter is smart about where it cuts. It tries to split at paragraph breaks first, then sentences, then words. The chunk_overlap parameter creates overlap between consecutive chunks so you don’t lose context at the boundaries.

Choosing Chunk Size

Here’s my rule of thumb:

Chunk Size	Best For	Trade-off
200-300 chars	Precise factual Q&A	May lack context
500-800 chars	General document Q&A	Good balance (start here)
1000-1500 chars	Summarization, complex topics	More noise per chunk

Start with 500 characters and 50 character overlap. Adjust based on your results. This is the single most impactful tuning knob in your entire RAG system, and most people barely touch it.

Generating Embeddings with Ollama

Now we convert each chunk into a vector using Ollama’s embedding API:

import requests
from typing import list

OLLAMA_BASE_URL = "http://localhost:11434"

def get_embedding(text: str, model: str = "nomic-embed-text") -> list[float]:
    """Generate embedding for a single text using Ollama."""
    response = requests.post(
        f"{OLLAMA_BASE_URL}/api/embeddings",
        json={"model": model, "prompt": text}
    )
    response.raise_for_status()
    return response.json()["embedding"]

def get_embeddings_batch(texts: list[str],
                         model: str = "nomic-embed-text") -> list[list[float]]:
    """Generate embeddings for multiple texts."""
    embeddings = []
    for i, text in enumerate(texts):
        embedding = get_embedding(text, model)
        embeddings.append(embedding)
        if (i + 1) % 50 == 0:
            print(f"  Embedded {i + 1}/{len(texts)} chunks...")
    return embeddings

Each call to the embedding endpoint takes a string and returns a 768-dimensional vector. On a decent CPU, expect around 10-30 chunks per second. With a GPU, it’s significantly faster.

Storing in ChromaDB

Now we stick those embeddings into ChromaDB:

import chromadb

def create_collection(chunks: list[dict],
                      collection_name: str = "knowledge_base"):
    """Create a ChromaDB collection and add document chunks."""
    client = chromadb.HttpClient(host="localhost", port=8000)

    # Delete existing collection if it exists (for re-indexing)
    try:
        client.delete_collection(collection_name)
    except Exception:
        pass

    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # Use cosine similarity
    )

    # Prepare data for batch insertion
    documents = [chunk['content'] for chunk in chunks]
    metadatas = [chunk['metadata'] for chunk in chunks]
    ids = [f'chunk_{i}" for i in range(len(chunks))]

    # Generate embeddings
    print("Generating embeddings...")
    embeddings = get_embeddings_batch(documents)

    # Add to collection in batches
    batch_size = 100
    for i in range(0, len(documents), batch_size):
        end = min(i + batch_size, len(documents))
        collection.add(
            documents=documents[i:end],
            embeddings=embeddings[i:end],
            metadatas=metadatas[i:end],
            ids=ids[i:end]
        )

    print(f"Added {len(documents)} chunks to collection '{collection_name}'")
    return collection

The hnsw:space: cosine setting tells ChromaDB to use cosine similarity for vector comparison. This is the standard choice for text embeddings — it measures the angle between vectors rather than the distance, which works better for comparing semantic meaning.

Putting Ingestion Together

Here’s the complete ingestion pipeline:

def ingest_documents(docs_dir: str = "docs",
                     collection_name: str = "knowledge_base"):
    """Full ingestion pipeline: load -> chunk -> embed -> store."""
    # Load
    documents = load_documents(docs_dir)
    if not documents:
        print("No documents found!")
        return None

    # Chunk
    chunks = chunk_documents(documents, chunk_size=500, chunk_overlap=50)

    # Embed and store
    collection = create_collection(chunks, collection_name)

    return collection

# Run it
collection = ingest_documents("docs")

Drop some text files in docs/, run this script, and your knowledge base is built. That’s the hardest part done.

Querying: Asking Questions

Now for the fun part. Let’s build the query pipeline that retrieves relevant chunks and generates answers.

Retrieval

def retrieve_context(query: str,
                     collection_name: str = "knowledge_base",
                     n_results: int = 5) -> list[dict]:
    """Retrieve the most relevant chunks for a query."""
    client = chromadb.HttpClient(host="localhost", port=8000)
    collection = client.get_collection(collection_name)

    # Generate query embedding
    query_embedding = get_embedding(query)

    # Search
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )

    # Format results
    contexts = []
    for i in range(len(results['documents'][0])):
        contexts.append({
            'content': results['documents'][0][i],
            'metadata': results['metadatas'][0][i],
            'distance': results['distances'][0][i]
        })

    return contexts

This takes a question, embeds it with the same model used for the documents (this is important — always use the same embedding model for ingestion and querying), searches ChromaDB for the nearest vectors, and returns the top matches.

Generation

Now we feed those retrieved chunks to Ollama’s chat model:

def generate_answer(query: str,
                    contexts: list[dict],
                    model: str = "llama3") -> str:
    """Generate an answer using retrieved context."""
    # Build context string
    context_text = "\n\n---\n\n".join([
        f"[Source: {ctx['metadata'].get('source', 'unknown')}]\n{ctx['content']}"
        for ctx in contexts
    ])

    # Build prompt
    prompt = f"""You are a helpful assistant. Answer the user's question based ONLY
on the provided context. If the context doesn't contain enough information to answer
the question, say so — do not make up information.

CONTEXT:
{context_text}

QUESTION: {query}

ANSWER:"""

    # Call Ollama
    response = requests.post(
        f"{OLLAMA_BASE_URL}/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.3,  # Lower = more factual
                "num_ctx": 4096      # Context window size
            }
        }
    )
    response.raise_for_status()
    return response.json()["response"]

A few key details in that prompt:

“based ONLY on the provided context” — This is the anti-hallucination instruction. Without it, the model will happily fill gaps with made-up information.
Temperature 0.3 — Lower temperature means more deterministic, factual responses. For creative writing, crank it up. For document Q&A, keep it low.
Source attribution — We include the source file in the context so the model can reference where information came from.

The Complete Query Function

def ask(query: str,
        collection_name: str = "knowledge_base",
        n_results: int = 5,
        model: str = "llama3") -> str:
    """Full RAG pipeline: retrieve context, then generate answer."""
    print(f"\nQuery: {query}")
    print("Retrieving relevant context...")

    contexts = retrieve_context(query, collection_name, n_results)

    print(f"Found {len(contexts)} relevant chunks:")
    for i, ctx in enumerate(contexts):
        source = ctx['metadata'].get('source', 'unknown')
        distance = ctx['distance']
        print(f"  [{i+1}] {source} (similarity: {1 - distance:.3f})")

    print("Generating answer...")
    answer = generate_answer(query, contexts, model)

    print(f"\nAnswer: {answer}")
    return answer

# Try it out
ask("What is our refund policy?")
ask("How do I configure the API authentication?")

That’s your complete RAG system. Load docs, chunk them, embed them, store them, retrieve them, generate answers. All local. All free.

Real-World Walkthrough: Internal Docs Chatbot

Let’s make this concrete. Say you’re building a chatbot that answers questions about your company’s internal documentation — onboarding guides, API docs, HR policies, engineering runbooks.

Step 1: Organize Your Docs

docs/
├── onboarding/
│   ├── getting-started.md
│   ├── dev-environment-setup.md
│   └── team-structure.md
├── engineering/
│   ├── api-reference.md
│   ├── deployment-guide.md
│   └── incident-response.md
└── hr/
    ├── pto-policy.md
    ├── expense-reports.md
    └── benefits-guide.md

Step 2: Enhance the Loader for Subdirectories

import os

def load_documents_recursive(docs_dir: str) -> list[dict]:
    """Load documents from nested directory structure."""
    documents = []
    for root, dirs, files in os.walk(docs_dir):
        for filename in files:
            if filename.endswith(('.txt', '.md')):
                filepath = os.path.join(root, filename)
                # Get relative path for better source tracking
                rel_path = os.path.relpath(filepath, docs_dir)
                category = os.path.dirname(rel_path) or "general"

                with open(filepath, 'r', encoding='utf-8') as f:
                    content = f.read()

                documents.append({
                    'content': content,
                    'metadata': {
                        'source': filename,
                        'filepath': rel_path,
                        'category': category
                    }
                })

    print(f"Loaded {len(documents)} documents from {docs_dir}")
    return documents

Step 3: Add a Simple Chat Loop

def chat():
    """Interactive chat loop for your knowledge base."""
    print("=" * 50)
    print("  Internal Docs Assistant")
    print("  Type 'quit' to exit, 'reindex' to rebuild")
    print("=" * 50)

    while True:
        query = input("\nYou: ").strip()

        if not query:
            continue
        if query.lower() == 'quit':
            print("Goodbye!")
            break
        if query.lower() == 'reindex':
            ingest_documents("docs")
            print("Re-indexed!")
            continue

        answer = ask(query)
        print(f"\nAssistant: {answer}")

chat()

That’s it. You now have an internal docs chatbot. Your team can ask natural-language questions about your documentation and get answers grounded in your actual content.

Performance Tips

Now that you have a working system, here are some ways to make it better:

1. Tune Your Chunk Size Empirically

Don’t just guess. Try different chunk sizes (300, 500, 800, 1200) on the same set of test questions and compare answer quality. The “right” size depends entirely on your documents. Technical API docs with dense information often do better with smaller chunks. Narrative documents like policies and guides can handle larger ones.

2. Use Metadata Filtering

ChromaDB supports metadata filters. If your user asks a question and you know it’s about HR policies, filter the search:

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    where={"category": "hr"}  # Only search HR docs
)

This dramatically improves relevance when you have diverse document types.

3. Re-rank Results

Vector similarity isn’t perfect. A cheap trick that helps: retrieve more results than you need (say, 10-15), then use the LLM itself to re-rank them by relevance before using the top 5 for generation. It adds latency but improves answer quality significantly.

4. Cache Embeddings

If your documents don’t change often, cache the embeddings. Don’t re-generate them every time you restart. ChromaDB persists data to disk by default with our Docker setup, so your embeddings survive container restarts.

5. Consider Your Hardware

CPU-only: Works fine. Expect 2-5 seconds for embedding, 10-30 seconds for generation with Llama 3 8B.
NVIDIA GPU: Uncomment the GPU section in the Docker Compose file. Embedding becomes near-instant, generation drops to 2-5 seconds.
Apple Silicon: Run Ollama natively (not in Docker) to leverage the Metal GPU. Huge performance boost.

If you’re on an M1/M2/M3 Mac, install Ollama directly via brew install ollama instead of Docker — you’ll get significantly better performance through Metal acceleration.

When RAG Beats Fine-Tuning

People often ask: “Should I fine-tune a model on my data or use RAG?” Here’s the decision framework:

Choose RAG when:

Your data changes frequently (docs get updated, new content added)
You need source attribution (showing where answers came from)
You want to keep your base model general-purpose
Your data is proprietary and you don’t want it baked into model weights
You need to get something working this week, not this quarter

Choose fine-tuning when:

You need the model to adopt a specific tone, style, or behavior
Your task is highly specialized (medical coding, legal classification)
Retrieval latency is unacceptable for your use case
Your knowledge is static and well-defined

Choose both when:

You want the model to understand your domain deeply (fine-tune) AND stay current with changing data (RAG)

For most teams building internal tools, RAG is the right choice. It’s faster to set up, easier to maintain, and your data stays separate from the model — which means you can swap models without re-training.

Common Pitfalls (Learn from My Mistakes)

1. Using different embedding models for ingestion and querying. The vectors have to live in the same space. If you embed docs with nomic-embed-text and queries with all-minilm, your results will be garbage. Always match your models.

2. Not handling document updates. When a document changes, you need to re-ingest it. Build a simple mechanism to detect changes (file modification timestamps, hashes) and re-index only what changed.

3. Ignoring chunk boundaries. If a critical piece of information spans two chunks and neither chunk captures the full context, your retrieval will miss it. That’s what chunk_overlap is for. Don’t set it to zero.

4. Stuffing too much context into the prompt. More retrieved chunks isn’t always better. If you stuff 20 chunks into the prompt, the LLM has to parse through a wall of text and might get confused or miss the relevant part. 3-5 well-chosen chunks usually outperform 15 mediocre ones.

5. Skipping evaluation. Build a small test set of question-answer pairs. Run your RAG system against them periodically to measure quality. Without this, you’re tuning blind.

Wrapping Up

You now have everything you need to build a local, private, free RAG system. The stack is simple: Ollama for embeddings and generation, ChromaDB for vector storage, and some Python glue to tie it all together.

Is it going to outperform a RAG system built on GPT-4 Turbo and Pinecone with a dedicated ML engineering team? No. But it’ll run on your laptop, cost you nothing, keep your data private, and be more than capable enough for internal docs, personal knowledge bases, and proof-of-concept projects.

The barrier to entry for AI-powered knowledge systems has essentially dropped to zero. The only thing standing between you and a custom AI that actually knows your stuff is about an hour of setup and a docker compose up.

Go build something.