Skip to content
Go back

The Embedding Model Choice Nobody Explains

Updated:
By SumGuy 6 min read
The Embedding Model Choice Nobody Explains

The Default Assumption

You’re building a RAG system. You need embeddings. So you use OpenAI’s API:

from openai import OpenAI
client = OpenAI(api_key="...")
embedding = client.embeddings.create(
input="The quick brown fox jumps over the lazy dog",
model="text-embedding-3-small"
)

It works. Your vectors go to OpenAI, get embedded, return to you. It’s convenient. But you’re paying per token, and OpenAI sees all your data.

Most teams never consider alternatives. That’s a mistake.

What Embeddings Actually Are

Here’s the thing: embeddings are just numbers. Specifically, lists of numbers (vectors) that represent meaning. When you feed text into an embedding model, it converts that text into a list of dimensions—usually 384 to 1024 floats. The model learns (or is trained) so that similar text produces similar vectors. That’s it.

RAG systems use embeddings to find relevant documents. You embed your query, embed your documents, then use math (cosine similarity, usually) to find which documents are closest to your query in vector space. The better the embedding model, the more “meaning-aware” that similarity is.

Think of it like this: a bad embedding model might think “apple fruit” and “apple computer” are equally similar. A good one knows they’re different meanings in different contexts.

The Three Embedding Categories

1. Closed-Source APIs (OpenAI, Cohere, etc.)

Pros:
- Well-tuned, high quality
- Constantly updated
- Industry standard (good for benchmarks)
Cons:
- Data leaves your infrastructure
- Costs per token
- Vendor lock-in

When to use: You have sensitive data going elsewhere anyway, or cost is irrelevant.

2. Open-Source Models (Sentence Transformers, etc.)

Pros:
- Runs locally (no data leakage)
- Free
- Customizable (can fine-tune)
Cons:
- Lower quality than OpenAI on some tasks
- You maintain the inference
- Need to pick the right model

When to use: Privacy matters, or you have lots of domain-specific data.

3. Hybrid (API with open source fallback)

Pros:
- Best of both worlds
- API for critical paths, local for cost
- Flexibility
Cons:
- More complex
- Inconsistent embeddings across sources

When to use: Large-scale systems where costs matter.

Local Embeddings: The Practical Options

Here are the free options that actually work:

ModelDimensionSpeedQualityUse Case
nomic-embed-text-v1.5768FastExcellentDefault choice, balanced
bge-m31024ModerateExcellentMultilingual, complex queries
all-MiniLM-L6-v2384Very FastGoodCPU-only, simple docs
all-mpnet-base-v2768ModerateGoodBetter than MiniLM, slower
mxbai-embed-large1024ModerateVery GoodDense queries, slower

The quality difference is real but subtle for most use cases. nomic-embed-text is the sweet spot: 768 dimensions, fast, excellent MTEB scores, and works great in Ollama.

Why Model Choice Actually Matters for RAG

Embedding quality affects ranking, not retrieval. Here’s the difference:

Bad embedding model:
- Query: "How do I use Docker?"
- Top result: "Docker is used by companies" ✓ Correct (but generic)
- Rank 5: "Why is my cat purple?" ✓ Obviously wrong, ranked low
Good embedding model:
- Query: "How do I use Docker?"
- Top result: "docker run -it image bash" ✓ Specific, actionable
- Rank 5: Still wrong, still ranked low

For RAG, this matters because:

On the MTEB (Massive Text Embedding Benchmark) leaderboard, you’ll see models ranked. nomic-embed-text and bge-m3 consistently rank in the top 10 for open-source. That’s not random—these models genuinely understand meaning better.

Dimensions vs. Quality: The Tradeoff

More dimensions = more expressiveness, but also slower and more storage. For most RAG:

Unless you’re doing million-scale retrieval or running on a potato, 768 is fine.

Practical Setup: Ollama + Local Embeddings

Pull and Embed with Ollama

Terminal window
# Download the embedding model (one-time, ~300MB)
ollama pull nomic-embed-text:v1.5
# Test it
curl http://localhost:11434/api/embeddings \
-d '{"model":"nomic-embed-text:v1.5","prompt":"Docker containers are portable"}'

Python Integration

import ollama
text = "Docker containers are lightweight and portable"
response = ollama.embeddings(
model="nomic-embed-text:v1.5",
prompt=text
)
embedding = response["embedding"]
print(len(embedding)) # 768 dimensions

First run downloads the model. Subsequent runs are fast.

Or Use Sentence Transformers Directly

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")
texts = [
"Docker is a containerization platform",
"Kubernetes orchestrates containers",
"My cat is sleeping"
]
embeddings = model.encode(texts)
print(embeddings.shape) # (3, 768)

Test on Your Data

Don’t trust benchmarks. Benchmark on your actual documents.

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
model_fast = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim
model_good = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5") # 768-dim
docs = [
"Docker containers isolate applications",
"Kubernetes automates container deployment",
"AWS Lambda runs serverless functions",
"Python is a programming language"
]
query = "How do I containerize my app?"
# Embed with both models
q1, d1 = model_fast.encode(query), model_fast.encode(docs)
q2, d2 = model_good.encode(query), model_good.encode(docs)
# Compare rankings
scores1 = cosine_similarity([q1], d1)[0]
scores2 = cosine_similarity([q2], d2)[0]
print("MiniLM ranking:", np.argsort(-scores1))
print("nomic ranking:", np.argsort(-scores2))
# Does nomic rank "Docker containers..." higher? Worth the extra latency?

The Cost/Speed Math

OpenAI text-embedding-3-small:
- Cost: $0.02 per 1M tokens
- Speed: Network latency + API time (~100-200ms)
- Quality: Excellent
Local (nomic-embed-text on GPU):
- Cost: Free
- Speed: ~10-50ms per document
- Quality: Excellent (MTEB top 10)
Embedding a 1000-page knowledge base (500 tokens/page):
OpenAI: ~$10 one-time
Local: Free (your GPU does the work)
10K queries/month:
OpenAI: ~$2/month
Local: Free

If you’re querying frequently, local always wins on cost. If you’re privacy-conscious or self-hosting, local is non-negotiable.

When to Upgrade vs. Stick

Stick with all-MiniLM if:

Upgrade to nomic-embed-text or bge-m3 if:

Match embedding dimensions to LLM context:

The Real Decision

Use OpenAI if: Privacy isn’t a concern, budget is unlimited, you want SOTA quality.

Use local if: Privacy matters, you query frequently, you embed a large corpus once, you self-host.

Use hybrid if: Critical queries use API, bulk queries use local.

Most teams go local and never look back. The quality difference is 5–10% for common tasks, not worth the cost and privacy tradeoff.

Test on your data, pick what works, commit to it.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
Directory FileCount
Next Post
EmDash: WordPress Done Right, Finally

Related Posts