The Embedding Model Choice Nobody Explains

The Default Assumption

You’re building a RAG system. You need embeddings. So you use OpenAI’s API:

from openai import OpenAI

client = OpenAI(api_key="...")

embedding = client.embeddings.create(
    input="The quick brown fox jumps over the lazy dog",
    model="text-embedding-3-small"
)

It works. Your vectors go to OpenAI, get embedded, return to you. It’s convenient. But you’re paying per token, and OpenAI sees all your data.

Most teams never consider alternatives. That’s a mistake.

What Embeddings Actually Are

Here’s the thing: embeddings are just numbers. Specifically, lists of numbers (vectors) that represent meaning. When you feed text into an embedding model, it converts that text into a list of dimensions—usually 384 to 1024 floats. The model learns (or is trained) so that similar text produces similar vectors. That’s it.

RAG systems use embeddings to find relevant documents. You embed your query, embed your documents, then use math (cosine similarity, usually) to find which documents are closest to your query in vector space. The better the embedding model, the more “meaning-aware” that similarity is.

Think of it like this: a bad embedding model might think “apple fruit” and “apple computer” are equally similar. A good one knows they’re different meanings in different contexts.

The Three Embedding Categories

1. Closed-Source APIs (OpenAI, Cohere, etc.)

Pros:
- Well-tuned, high quality
- Constantly updated
- Industry standard (good for benchmarks)

Cons:
- Data leaves your infrastructure
- Costs per token
- Vendor lock-in

When to use: You have sensitive data going elsewhere anyway, or cost is irrelevant.

2. Open-Source Models (Sentence Transformers, etc.)

Pros:
- Runs locally (no data leakage)
- Free
- Customizable (can fine-tune)

Cons:
- Lower quality than OpenAI on some tasks
- You maintain the inference
- Need to pick the right model

When to use: Privacy matters, or you have lots of domain-specific data.

3. Hybrid (API with open source fallback)

Pros:
- Best of both worlds
- API for critical paths, local for cost
- Flexibility

Cons:
- More complex
- Inconsistent embeddings across sources

When to use: Large-scale systems where costs matter.

Local Embeddings: The Practical Options

Here are the free options that actually work:

Model	Dimension	Speed	Quality	Use Case
nomic-embed-text-v1.5	768	Fast	Excellent	Default choice, balanced
bge-m3	1024	Moderate	Excellent	Multilingual, complex queries
all-MiniLM-L6-v2	384	Very Fast	Good	CPU-only, simple docs
all-mpnet-base-v2	768	Moderate	Good	Better than MiniLM, slower
mxbai-embed-large	1024	Moderate	Very Good	Dense queries, slower

The quality difference is real but subtle for most use cases. nomic-embed-text is the sweet spot: 768 dimensions, fast, excellent MTEB scores, and works great in Ollama.

Why Model Choice Actually Matters for RAG

Embedding quality affects ranking, not retrieval. Here’s the difference:

Bad embedding model:
- Query: "How do I use Docker?"
- Top result: "Docker is used by companies"  ✓ Correct (but generic)
- Rank 5: "Why is my cat purple?"  ✓ Obviously wrong, ranked low

Good embedding model:
- Query: "How do I use Docker?"
- Top result: "docker run -it image bash"  ✓ Specific, actionable
- Rank 5: Still wrong, still ranked low

For RAG, this matters because:

You typically include 3–5 retrieved chunks in your LLM’s context
If embedding quality is poor, your best chunk might be rank 7, not rank 1
The LLM sees less-relevant context first, and garbage in = garbage out

On the MTEB (Massive Text Embedding Benchmark) leaderboard, you’ll see models ranked. nomic-embed-text and bge-m3 consistently rank in the top 10 for open-source. That’s not random—these models genuinely understand meaning better.

Dimensions vs. Quality: The Tradeoff

More dimensions = more expressiveness, but also slower and more storage. For most RAG:

384 dimensions: CPU-friendly, still decent quality (MiniLM)
768 dimensions: Best balance (nomic, all-mpnet)
1024+ dimensions: Maximum quality, but slower (bge-m3, mxbai)

Unless you’re doing million-scale retrieval or running on a potato, 768 is fine.

Practical Setup: Ollama + Local Embeddings

Pull and Embed with Ollama

# Download the embedding model (one-time, ~300MB)
ollama pull nomic-embed-text:v1.5

# Test it
curl http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text:v1.5","prompt":"Docker containers are portable"}'

Python Integration

import ollama

text = "Docker containers are lightweight and portable"

response = ollama.embeddings(
    model="nomic-embed-text:v1.5",
    prompt=text
)

embedding = response["embedding"]
print(len(embedding))  # 768 dimensions

First run downloads the model. Subsequent runs are fast.

Or Use Sentence Transformers Directly

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")

texts = [
    "Docker is a containerization platform",
    "Kubernetes orchestrates containers",
    "My cat is sleeping"
]

embeddings = model.encode(texts)
print(embeddings.shape)  # (3, 768)

Test on Your Data

Don’t trust benchmarks. Benchmark on your actual documents.

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

model_fast = SentenceTransformer("all-MiniLM-L6-v2")    # 384-dim
model_good = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")  # 768-dim

docs = [
    "Docker containers isolate applications",
    "Kubernetes automates container deployment",
    "AWS Lambda runs serverless functions",
    "Python is a programming language"
]

query = "How do I containerize my app?"

# Embed with both models
q1, d1 = model_fast.encode(query), model_fast.encode(docs)
q2, d2 = model_good.encode(query), model_good.encode(docs)

# Compare rankings
scores1 = cosine_similarity([q1], d1)[0]
scores2 = cosine_similarity([q2], d2)[0]

print("MiniLM ranking:", np.argsort(-scores1))
print("nomic ranking:", np.argsort(-scores2))

# Does nomic rank "Docker containers..." higher? Worth the extra latency?

The Cost/Speed Math

OpenAI text-embedding-3-small:
- Cost: $0.02 per 1M tokens
- Speed: Network latency + API time (~100-200ms)
- Quality: Excellent

Local (nomic-embed-text on GPU):
- Cost: Free
- Speed: ~10-50ms per document
- Quality: Excellent (MTEB top 10)

Embedding a 1000-page knowledge base (500 tokens/page):
OpenAI: ~$10 one-time
Local: Free (your GPU does the work)

10K queries/month:
OpenAI: ~$2/month
Local: Free

If you’re querying frequently, local always wins on cost. If you’re privacy-conscious or self-hosting, local is non-negotiable.

When to Upgrade vs. Stick

Stick with all-MiniLM if:

Running on CPU only
Documents are simple/straightforward
You’re just prototyping

Upgrade to nomic-embed-text or bge-m3 if:

You have a GPU or good CPU
Your documents are technical or nuanced
Query ranking quality matters (it does in production RAG)
You’ve benchmarked and MiniLM ranks important docs poorly

Match embedding dimensions to LLM context:

Small LLM (8K context): smaller embeddings (384) is fine
Large LLM (128K+ context): bigger embeddings (768+) help more
Reason: more chunks = more subtle relevance matters

The Real Decision

Use OpenAI if: Privacy isn’t a concern, budget is unlimited, you want SOTA quality.

Use local if: Privacy matters, you query frequently, you embed a large corpus once, you self-host.

Use hybrid if: Critical queries use API, bulk queries use local.

Most teams go local and never look back. The quality difference is 5–10% for common tasks, not worth the cost and privacy tradeoff.

Test on your data, pick what works, commit to it.