Choosing Embedding Models
Guide to selecting the right embedding model for your RAG use case
Overview
Selecting the right embedding model is crucial for RAG performance. The "best" model depends on your specific requirements: accuracy, speed, cost, language support, and domain specialization.
Decision Framework
1. Define Your Requirements
Ask these questions:
Performance Requirements:
- What's your acceptable latency? (< 50ms, < 200ms, < 1s)
- How many queries per second?
- What's your accuracy threshold?
Resource Constraints:
- Running locally or via API?
- GPU available?
- Storage limitations?
Domain Specifics:
- General knowledge or specialized domain?
- Single language or multilingual?
- Short queries or long documents?
Model Categories
Open-Source Models (Self-Hosted)
Small & Fast (384 dimensions)
all-MiniLM-L6-v2
- Speed: ~1000 docs/sec on CPU
- Quality: Good for general use
- Size: 80MB
- Best for: High-throughput, resource-constrained
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("your text") # 384 dimensions
Pros:
- Very fast inference
- Small model size
- Low memory footprint
Cons:
- Less nuanced semantic understanding
- Struggles with complex queries
Balanced (768 dimensions)
all-mpnet-base-v2
- Speed: ~400 docs/sec on CPU
- Quality: Better semantic understanding
- Size: 420MB
- Best for: Most production RAG systems
model = SentenceTransformer('all-mpnet-base-v2')
embedding = model.encode("your text") # 768 dimensions
Pros:
- Good accuracy/speed tradeoff
- Handles complex queries well
- Widely tested and reliable
Cons:
- Slower than MiniLM
- Larger model size
High Quality (768-1024 dimensions)
BAAI/bge-large-en-v1.5
- Speed: ~200 docs/sec on CPU
- Quality: State-of-the-art for English
- Size: 1.3GB
- Best for: Accuracy-critical applications
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embedding = model.encode("your text") # 1024 dimensions
Pros:
- Excellent retrieval quality
- Strong performance on benchmarks
- Good for complex domains
Cons:
- Slower inference
- Requires more resources
API-Based Models
OpenAI Embeddings
text-embedding-3-small
- Dimensions: 1536
- Cost: $0.02 / 1M tokens
- Best for: Quick prototyping
text-embedding-3-large
- Dimensions: 3072
- Cost: $0.13 / 1M tokens
- Best for: Maximum quality
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="your text"
)
embedding = response.data[0].embedding
Pros:
- No infrastructure management
- Consistent quality
- Regular updates
Cons:
- Ongoing costs
- API dependency
- Data privacy concerns
Cohere Embeddings
embed-english-v3.0
- Dimensions: 1024
- Cost: $0.10 / 1M tokens
- Best for: English-only applications
import cohere
co = cohere.Client('your-api-key')
response = co.embed(
texts=["your text"],
model="embed-english-v3.0"
)
embedding = response.embeddings[0]
Selection Matrix
| Use Case | Recommended Model | Reasoning |
|---|---|---|
| Startup/Prototype | all-MiniLM-L6-v2 | Fast iteration, low cost |
| Production (General) | all-mpnet-base-v2 | Best balance |
| High Accuracy | BAAI/bge-large-en-v1.5 | Maximum quality |
| Multilingual | paraphrase-multilingual-mpnet-base-v2 | 50+ languages |
| Code Search | microsoft/codebert-base | Code-specific |
| Legal/Medical | Fine-tuned domain model | Specialized terminology |
| Quick Prototype | text-embedding-3-small | No setup needed |
Evaluation Methodology
1. Create Test Dataset
test_queries = [
{
"query": "How do I reset my password?",
"expected_doc_id": "doc_123"
},
{
"query": "Refund policy",
"expected_doc_id": "doc_456"
}
# ... 100+ examples
]
2. Benchmark Models
from sentence_transformers import SentenceTransformer
import time
def benchmark_model(model_name, test_queries):
model = SentenceTransformer(model_name)
# Measure speed
start = time.time()
embeddings = model.encode([q['query'] for q in test_queries])
speed = len(test_queries) / (time.time() - start)
# Measure accuracy (recall@5)
correct = 0
for i, query in enumerate(test_queries):
results = search(embeddings[i], k=5)
if query['expected_doc_id'] in results:
correct += 1
accuracy = correct / len(test_queries)
return {
'model': model_name,
'speed': speed,
'accuracy': accuracy
}
# Compare models
models = [
'all-MiniLM-L6-v2',
'all-mpnet-base-v2',
'BAAI/bge-large-en-v1.5'
]
results = [benchmark_model(m, test_queries) for m in models]
3. Cost Analysis
def calculate_monthly_cost(
queries_per_day,
avg_query_tokens,
model_type='api', # 'api' or 'self-hosted'
api_cost_per_1m=0.02 # OpenAI text-embedding-3-small
):
if model_type == 'api':
monthly_tokens = queries_per_day * avg_query_tokens * 30
monthly_cost = (monthly_tokens / 1_000_000) * api_cost_per_1m
return monthly_cost
else:
# Self-hosted: GPU instance cost
return 100 # Example: $100/month for GPU instance
# Example
api_cost = calculate_monthly_cost(
queries_per_day=10000,
avg_query_tokens=50,
model_type='api'
)
print(f"API cost: ${api_cost}/month")
self_hosted_cost = calculate_monthly_cost(
queries_per_day=10000,
model_type='self-hosted'
)
print(f"Self-hosted cost: ${self_hosted_cost}/month")
Domain-Specific Considerations
Code Search
Use code-specific models:
# For code similarity
model = SentenceTransformer('microsoft/codebert-base')
# Or fine-tune on your codebase
from sentence_transformers import InputExample
train_examples = [
InputExample(texts=['def hello():', 'function hello() {}']),
# ... more code pairs
]
E-commerce
Prioritize product attribute understanding:
# Fine-tune on product descriptions
model = SentenceTransformer('all-mpnet-base-v2')
# Train with product-specific data
train_data = [
("red cotton t-shirt", "scarlet cotton tee"),
("wireless headphones", "bluetooth earphones"),
]
Legal/Medical
Critical: Use domain-specific or fine-tuned models:
# Medical
model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')
# Legal
# Fine-tune on legal corpus (case law, contracts)
Migration Strategy
Switching Models
When upgrading models, re-embed all documents:
def migrate_embeddings(old_model, new_model, documents):
"""Migrate from one model to another"""
# Load new model
new_encoder = SentenceTransformer(new_model)
# Re-embed all documents
new_embeddings = new_encoder.encode(
documents,
batch_size=32,
show_progress_bar=True
)
# Update vector database
update_vector_db(new_embeddings)
return new_embeddings
# Usage
migrate_embeddings(
old_model='all-MiniLM-L6-v2',
new_model='all-mpnet-base-v2',
documents=all_docs
)
A/B Testing
Test new models before full migration:
def ab_test_models(query, model_a, model_b):
"""Compare two models on same query"""
# Route 50% to each model
import random
model = model_a if random.random() < 0.5 else model_b
results = search_with_model(query, model)
# Log for analysis
log_search(query, model, results, user_satisfaction)
return results
Performance Optimization
Model Quantization
Reduce model size with minimal quality loss:
from optimum.onnxruntime import ORTModelForFeatureExtraction
# Convert to ONNX (faster inference)
model = ORTModelForFeatureExtraction.from_pretrained(
'all-mpnet-base-v2',
export=True
)
# 2-3x faster inference with minimal accuracy loss
Caching Strategy
Cache embeddings for common queries:
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_query_embedding(query):
return model.encode(query)
# Subsequent calls with same query are instant
Decision Checklist
- Defined latency requirements
- Measured query volume
- Calculated cost (API vs. self-hosted)
- Created evaluation dataset
- Benchmarked 2-3 candidate models
- Tested on real user queries
- Considered domain-specific needs
- Planned migration strategy
Next Steps
- Embedding Fundamentals - Understand how embeddings work
- Fine-Tuning Embeddings - Improve model for your domain
- Multilingual Embeddings - Handle multiple languages
Additional Resources
- MTEB Leaderboard - Model benchmarks
- Sentence Transformers Models
- OpenAI Embeddings Guide