Choosing Embedding Models

Guide to selecting the right embedding model for your RAG use case

Overview

Selecting the right embedding model is crucial for RAG performance. The "best" model depends on your specific requirements: accuracy, speed, cost, language support, and domain specialization.

Decision Framework

1. Define Your Requirements

Ask these questions:

Performance Requirements:

  • What's your acceptable latency? (< 50ms, < 200ms, < 1s)
  • How many queries per second?
  • What's your accuracy threshold?

Resource Constraints:

  • Running locally or via API?
  • GPU available?
  • Storage limitations?

Domain Specifics:

  • General knowledge or specialized domain?
  • Single language or multilingual?
  • Short queries or long documents?

Model Categories

Open-Source Models (Self-Hosted)

Small & Fast (384 dimensions)

all-MiniLM-L6-v2

  • Speed: ~1000 docs/sec on CPU
  • Quality: Good for general use
  • Size: 80MB
  • Best for: High-throughput, resource-constrained
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("your text")  # 384 dimensions

Pros:

  • Very fast inference
  • Small model size
  • Low memory footprint

Cons:

  • Less nuanced semantic understanding
  • Struggles with complex queries

Balanced (768 dimensions)

all-mpnet-base-v2

  • Speed: ~400 docs/sec on CPU
  • Quality: Better semantic understanding
  • Size: 420MB
  • Best for: Most production RAG systems
model = SentenceTransformer('all-mpnet-base-v2')
embedding = model.encode("your text")  # 768 dimensions

Pros:

  • Good accuracy/speed tradeoff
  • Handles complex queries well
  • Widely tested and reliable

Cons:

  • Slower than MiniLM
  • Larger model size

High Quality (768-1024 dimensions)

BAAI/bge-large-en-v1.5

  • Speed: ~200 docs/sec on CPU
  • Quality: State-of-the-art for English
  • Size: 1.3GB
  • Best for: Accuracy-critical applications
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embedding = model.encode("your text")  # 1024 dimensions

Pros:

  • Excellent retrieval quality
  • Strong performance on benchmarks
  • Good for complex domains

Cons:

  • Slower inference
  • Requires more resources

API-Based Models

OpenAI Embeddings

text-embedding-3-small

  • Dimensions: 1536
  • Cost: $0.02 / 1M tokens
  • Best for: Quick prototyping

text-embedding-3-large

  • Dimensions: 3072
  • Cost: $0.13 / 1M tokens
  • Best for: Maximum quality
from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="your text"
)

embedding = response.data[0].embedding

Pros:

  • No infrastructure management
  • Consistent quality
  • Regular updates

Cons:

  • Ongoing costs
  • API dependency
  • Data privacy concerns

Cohere Embeddings

embed-english-v3.0

  • Dimensions: 1024
  • Cost: $0.10 / 1M tokens
  • Best for: English-only applications
import cohere

co = cohere.Client('your-api-key')

response = co.embed(
    texts=["your text"],
    model="embed-english-v3.0"
)

embedding = response.embeddings[0]

Selection Matrix

Use CaseRecommended ModelReasoning
Startup/Prototypeall-MiniLM-L6-v2Fast iteration, low cost
Production (General)all-mpnet-base-v2Best balance
High AccuracyBAAI/bge-large-en-v1.5Maximum quality
Multilingualparaphrase-multilingual-mpnet-base-v250+ languages
Code Searchmicrosoft/codebert-baseCode-specific
Legal/MedicalFine-tuned domain modelSpecialized terminology
Quick Prototypetext-embedding-3-smallNo setup needed

Evaluation Methodology

1. Create Test Dataset

test_queries = [
    {
        "query": "How do I reset my password?",
        "expected_doc_id": "doc_123"
    },
    {
        "query": "Refund policy",
        "expected_doc_id": "doc_456"
    }
    # ... 100+ examples
]

2. Benchmark Models

from sentence_transformers import SentenceTransformer
import time

def benchmark_model(model_name, test_queries):
    model = SentenceTransformer(model_name)
    
    # Measure speed
    start = time.time()
    embeddings = model.encode([q['query'] for q in test_queries])
    speed = len(test_queries) / (time.time() - start)
    
    # Measure accuracy (recall@5)
    correct = 0
    for i, query in enumerate(test_queries):
        results = search(embeddings[i], k=5)
        if query['expected_doc_id'] in results:
            correct += 1
    
    accuracy = correct / len(test_queries)
    
    return {
        'model': model_name,
        'speed': speed,
        'accuracy': accuracy
    }

# Compare models
models = [
    'all-MiniLM-L6-v2',
    'all-mpnet-base-v2',
    'BAAI/bge-large-en-v1.5'
]

results = [benchmark_model(m, test_queries) for m in models]

3. Cost Analysis

def calculate_monthly_cost(
    queries_per_day,
    avg_query_tokens,
    model_type='api',  # 'api' or 'self-hosted'
    api_cost_per_1m=0.02  # OpenAI text-embedding-3-small
):
    if model_type == 'api':
        monthly_tokens = queries_per_day * avg_query_tokens * 30
        monthly_cost = (monthly_tokens / 1_000_000) * api_cost_per_1m
        return monthly_cost
    else:
        # Self-hosted: GPU instance cost
        return 100  # Example: $100/month for GPU instance

# Example
api_cost = calculate_monthly_cost(
    queries_per_day=10000,
    avg_query_tokens=50,
    model_type='api'
)
print(f"API cost: ${api_cost}/month")

self_hosted_cost = calculate_monthly_cost(
    queries_per_day=10000,
    model_type='self-hosted'
)
print(f"Self-hosted cost: ${self_hosted_cost}/month")

Domain-Specific Considerations

Code Search

Use code-specific models:

# For code similarity
model = SentenceTransformer('microsoft/codebert-base')

# Or fine-tune on your codebase
from sentence_transformers import InputExample

train_examples = [
    InputExample(texts=['def hello():', 'function hello() {}']),
    # ... more code pairs
]

E-commerce

Prioritize product attribute understanding:

# Fine-tune on product descriptions
model = SentenceTransformer('all-mpnet-base-v2')

# Train with product-specific data
train_data = [
    ("red cotton t-shirt", "scarlet cotton tee"),
    ("wireless headphones", "bluetooth earphones"),
]

Legal/Medical

Critical: Use domain-specific or fine-tuned models:

# Medical
model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')

# Legal
# Fine-tune on legal corpus (case law, contracts)

Migration Strategy

Switching Models

When upgrading models, re-embed all documents:

def migrate_embeddings(old_model, new_model, documents):
    """Migrate from one model to another"""
    
    # Load new model
    new_encoder = SentenceTransformer(new_model)
    
    # Re-embed all documents
    new_embeddings = new_encoder.encode(
        documents,
        batch_size=32,
        show_progress_bar=True
    )
    
    # Update vector database
    update_vector_db(new_embeddings)
    
    return new_embeddings

# Usage
migrate_embeddings(
    old_model='all-MiniLM-L6-v2',
    new_model='all-mpnet-base-v2',
    documents=all_docs
)

A/B Testing

Test new models before full migration:

def ab_test_models(query, model_a, model_b):
    """Compare two models on same query"""
    
    # Route 50% to each model
    import random
    model = model_a if random.random() < 0.5 else model_b
    
    results = search_with_model(query, model)
    
    # Log for analysis
    log_search(query, model, results, user_satisfaction)
    
    return results

Performance Optimization

Model Quantization

Reduce model size with minimal quality loss:

from optimum.onnxruntime import ORTModelForFeatureExtraction

# Convert to ONNX (faster inference)
model = ORTModelForFeatureExtraction.from_pretrained(
    'all-mpnet-base-v2',
    export=True
)

# 2-3x faster inference with minimal accuracy loss

Caching Strategy

Cache embeddings for common queries:

from functools import lru_cache

@lru_cache(maxsize=1000)
def get_query_embedding(query):
    return model.encode(query)

# Subsequent calls with same query are instant

Decision Checklist

  • Defined latency requirements
  • Measured query volume
  • Calculated cost (API vs. self-hosted)
  • Created evaluation dataset
  • Benchmarked 2-3 candidate models
  • Tested on real user queries
  • Considered domain-specific needs
  • Planned migration strategy

Next Steps

Additional Resources