The Default Assumption
You’re building a RAG system. You need embeddings. So you use OpenAI’s API:
from openai import OpenAI
client = OpenAI(api_key="...")
embedding = client.embeddings.create( input="The quick brown fox jumps over the lazy dog", model="text-embedding-3-small")It works. Your vectors go to OpenAI, get embedded, return to you. It’s convenient. But you’re paying per token, and OpenAI sees all your data.
Most teams never consider alternatives. That’s a mistake.
What Embeddings Actually Are
Here’s the thing: embeddings are just numbers. Specifically, lists of numbers (vectors) that represent meaning. When you feed text into an embedding model, it converts that text into a list of dimensions—usually 384 to 1024 floats. The model learns (or is trained) so that similar text produces similar vectors. That’s it.
RAG systems use embeddings to find relevant documents. You embed your query, embed your documents, then use math (cosine similarity, usually) to find which documents are closest to your query in vector space. The better the embedding model, the more “meaning-aware” that similarity is.
Think of it like this: a bad embedding model might think “apple fruit” and “apple computer” are equally similar. A good one knows they’re different meanings in different contexts.
The Three Embedding Categories
1. Closed-Source APIs (OpenAI, Cohere, etc.)
Pros:- Well-tuned, high quality- Constantly updated- Industry standard (good for benchmarks)
Cons:- Data leaves your infrastructure- Costs per token- Vendor lock-inWhen to use: You have sensitive data going elsewhere anyway, or cost is irrelevant.
2. Open-Source Models (Sentence Transformers, etc.)
Pros:- Runs locally (no data leakage)- Free- Customizable (can fine-tune)
Cons:- Lower quality than OpenAI on some tasks- You maintain the inference- Need to pick the right modelWhen to use: Privacy matters, or you have lots of domain-specific data.
3. Hybrid (API with open source fallback)
Pros:- Best of both worlds- API for critical paths, local for cost- Flexibility
Cons:- More complex- Inconsistent embeddings across sourcesWhen to use: Large-scale systems where costs matter.
Local Embeddings: The Practical Options
Here are the free options that actually work:
| Model | Dimension | Speed | Quality | Use Case |
|---|---|---|---|---|
| nomic-embed-text-v1.5 | 768 | Fast | Excellent | Default choice, balanced |
| bge-m3 | 1024 | Moderate | Excellent | Multilingual, complex queries |
| all-MiniLM-L6-v2 | 384 | Very Fast | Good | CPU-only, simple docs |
| all-mpnet-base-v2 | 768 | Moderate | Good | Better than MiniLM, slower |
| mxbai-embed-large | 1024 | Moderate | Very Good | Dense queries, slower |
The quality difference is real but subtle for most use cases. nomic-embed-text is the sweet spot: 768 dimensions, fast, excellent MTEB scores, and works great in Ollama.
Why Model Choice Actually Matters for RAG
Embedding quality affects ranking, not retrieval. Here’s the difference:
Bad embedding model:- Query: "How do I use Docker?"- Top result: "Docker is used by companies" ✓ Correct (but generic)- Rank 5: "Why is my cat purple?" ✓ Obviously wrong, ranked low
Good embedding model:- Query: "How do I use Docker?"- Top result: "docker run -it image bash" ✓ Specific, actionable- Rank 5: Still wrong, still ranked lowFor RAG, this matters because:
- You typically include 3–5 retrieved chunks in your LLM’s context
- If embedding quality is poor, your best chunk might be rank 7, not rank 1
- The LLM sees less-relevant context first, and garbage in = garbage out
On the MTEB (Massive Text Embedding Benchmark) leaderboard, you’ll see models ranked. nomic-embed-text and bge-m3 consistently rank in the top 10 for open-source. That’s not random—these models genuinely understand meaning better.
Dimensions vs. Quality: The Tradeoff
More dimensions = more expressiveness, but also slower and more storage. For most RAG:
- 384 dimensions: CPU-friendly, still decent quality (MiniLM)
- 768 dimensions: Best balance (nomic, all-mpnet)
- 1024+ dimensions: Maximum quality, but slower (bge-m3, mxbai)
Unless you’re doing million-scale retrieval or running on a potato, 768 is fine.
Practical Setup: Ollama + Local Embeddings
Pull and Embed with Ollama
# Download the embedding model (one-time, ~300MB)ollama pull nomic-embed-text:v1.5
# Test itcurl http://localhost:11434/api/embeddings \ -d '{"model":"nomic-embed-text:v1.5","prompt":"Docker containers are portable"}'Python Integration
import ollama
text = "Docker containers are lightweight and portable"
response = ollama.embeddings( model="nomic-embed-text:v1.5", prompt=text)
embedding = response["embedding"]print(len(embedding)) # 768 dimensionsFirst run downloads the model. Subsequent runs are fast.
Or Use Sentence Transformers Directly
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")
texts = [ "Docker is a containerization platform", "Kubernetes orchestrates containers", "My cat is sleeping"]
embeddings = model.encode(texts)print(embeddings.shape) # (3, 768)Test on Your Data
Don’t trust benchmarks. Benchmark on your actual documents.
from sentence_transformers import SentenceTransformerimport numpy as npfrom sklearn.metrics.pairwise import cosine_similarity
model_fast = SentenceTransformer("all-MiniLM-L6-v2") # 384-dimmodel_good = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5") # 768-dim
docs = [ "Docker containers isolate applications", "Kubernetes automates container deployment", "AWS Lambda runs serverless functions", "Python is a programming language"]
query = "How do I containerize my app?"
# Embed with both modelsq1, d1 = model_fast.encode(query), model_fast.encode(docs)q2, d2 = model_good.encode(query), model_good.encode(docs)
# Compare rankingsscores1 = cosine_similarity([q1], d1)[0]scores2 = cosine_similarity([q2], d2)[0]
print("MiniLM ranking:", np.argsort(-scores1))print("nomic ranking:", np.argsort(-scores2))
# Does nomic rank "Docker containers..." higher? Worth the extra latency?The Cost/Speed Math
OpenAI text-embedding-3-small:- Cost: $0.02 per 1M tokens- Speed: Network latency + API time (~100-200ms)- Quality: Excellent
Local (nomic-embed-text on GPU):- Cost: Free- Speed: ~10-50ms per document- Quality: Excellent (MTEB top 10)
Embedding a 1000-page knowledge base (500 tokens/page):OpenAI: ~$10 one-timeLocal: Free (your GPU does the work)
10K queries/month:OpenAI: ~$2/monthLocal: FreeIf you’re querying frequently, local always wins on cost. If you’re privacy-conscious or self-hosting, local is non-negotiable.
When to Upgrade vs. Stick
Stick with all-MiniLM if:
- Running on CPU only
- Documents are simple/straightforward
- You’re just prototyping
Upgrade to nomic-embed-text or bge-m3 if:
- You have a GPU or good CPU
- Your documents are technical or nuanced
- Query ranking quality matters (it does in production RAG)
- You’ve benchmarked and MiniLM ranks important docs poorly
Match embedding dimensions to LLM context:
- Small LLM (8K context): smaller embeddings (384) is fine
- Large LLM (128K+ context): bigger embeddings (768+) help more
- Reason: more chunks = more subtle relevance matters
The Real Decision
Use OpenAI if: Privacy isn’t a concern, budget is unlimited, you want SOTA quality.
Use local if: Privacy matters, you query frequently, you embed a large corpus once, you self-host.
Use hybrid if: Critical queries use API, bulk queries use local.
Most teams go local and never look back. The quality difference is 5–10% for common tasks, not worth the cost and privacy tradeoff.
Test on your data, pick what works, commit to it.