RAG Implementation Guide

Overview

Selecting the right embedding model is crucial for RAG performance. The "best" model depends on your specific requirements: accuracy, speed, cost, language support, and domain specialization.

Decision Framework

1. Define Your Requirements

Ask these questions:

Performance Requirements:

What's your acceptable latency? (< 50ms, < 200ms, < 1s)
How many queries per second?
What's your accuracy threshold?

Resource Constraints:

Running locally or via API?
GPU available?
Storage limitations?

Domain Specifics:

General knowledge or specialized domain?
Single language or multilingual?
Short queries or long documents?

Model Categories

Open-Source Models (Self-Hosted)

Small & Fast (384 dimensions)

all-MiniLM-L6-v2

Speed: ~1000 docs/sec on CPU
Quality: Good for general use
Size: 80MB
Best for: High-throughput, resource-constrained

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("your text")  # 384 dimensions

Pros:

Very fast inference
Small model size
Low memory footprint

Cons:

Less nuanced semantic understanding
Struggles with complex queries

Balanced (768 dimensions)

all-mpnet-base-v2

Speed: ~400 docs/sec on CPU
Quality: Better semantic understanding
Size: 420MB
Best for: Most production RAG systems

model = SentenceTransformer('all-mpnet-base-v2')
embedding = model.encode("your text")  # 768 dimensions

Pros:

Good accuracy/speed tradeoff
Handles complex queries well
Widely tested and reliable

Cons:

Slower than MiniLM
Larger model size

High Quality (768-1024 dimensions)

BAAI/bge-large-en-v1.5

Speed: ~200 docs/sec on CPU
Quality: State-of-the-art for English
Size: 1.3GB
Best for: Accuracy-critical applications

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embedding = model.encode("your text")  # 1024 dimensions

Pros:

Excellent retrieval quality
Strong performance on benchmarks
Good for complex domains

Cons:

Slower inference
Requires more resources

API-Based Models

OpenAI Embeddings

text-embedding-3-small

Dimensions: 1536
Cost: $0.02 / 1M tokens
Best for: Quick prototyping

text-embedding-3-large

Dimensions: 3072
Cost: $0.13 / 1M tokens
Best for: Maximum quality

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="your text"
)

embedding = response.data[0].embedding

Pros:

No infrastructure management
Consistent quality
Regular updates

Cons:

Ongoing costs
API dependency
Data privacy concerns

Cohere Embeddings

embed-english-v3.0

Dimensions: 1024
Cost: $0.10 / 1M tokens
Best for: English-only applications

import cohere

co = cohere.Client('your-api-key')

response = co.embed(
    texts=["your text"],
    model="embed-english-v3.0"
)

embedding = response.embeddings[0]

Selection Matrix

Use Case	Recommended Model	Reasoning
Startup/Prototype	all-MiniLM-L6-v2	Fast iteration, low cost
Production (General)	all-mpnet-base-v2	Best balance
High Accuracy	BAAI/bge-large-en-v1.5	Maximum quality
Multilingual	paraphrase-multilingual-mpnet-base-v2	50+ languages
Code Search	microsoft/codebert-base	Code-specific
Legal/Medical	Fine-tuned domain model	Specialized terminology
Quick Prototype	text-embedding-3-small	No setup needed

Evaluation Methodology

1. Create Test Dataset

test_queries = [
    {
        "query": "How do I reset my password?",
        "expected_doc_id": "doc_123"
    },
    {
        "query": "Refund policy",
        "expected_doc_id": "doc_456"
    }
    # ... 100+ examples
]

2. Benchmark Models

from sentence_transformers import SentenceTransformer
import time

def benchmark_model(model_name, test_queries):
    model = SentenceTransformer(model_name)
    
    # Measure speed
    start = time.time()
    embeddings = model.encode([q['query'] for q in test_queries])
    speed = len(test_queries) / (time.time() - start)
    
    # Measure accuracy (recall@5)
    correct = 0
    for i, query in enumerate(test_queries):
        results = search(embeddings[i], k=5)
        if query['expected_doc_id'] in results:
            correct += 1
    
    accuracy = correct / len(test_queries)
    
    return {
        'model': model_name,
        'speed': speed,
        'accuracy': accuracy
    }

# Compare models
models = [
    'all-MiniLM-L6-v2',
    'all-mpnet-base-v2',
    'BAAI/bge-large-en-v1.5'
]

results = [benchmark_model(m, test_queries) for m in models]

3. Cost Analysis

def calculate_monthly_cost(
    queries_per_day,
    avg_query_tokens,
    model_type='api',  # 'api' or 'self-hosted'
    api_cost_per_1m=0.02  # OpenAI text-embedding-3-small
):
    if model_type == 'api':
        monthly_tokens = queries_per_day * avg_query_tokens * 30
        monthly_cost = (monthly_tokens / 1_000_000) * api_cost_per_1m
        return monthly_cost
    else:
        # Self-hosted: GPU instance cost
        return 100  # Example: $100/month for GPU instance

# Example
api_cost = calculate_monthly_cost(
    queries_per_day=10000,
    avg_query_tokens=50,
    model_type='api'
)
print(f"API cost: ${api_cost}/month")

self_hosted_cost = calculate_monthly_cost(
    queries_per_day=10000,
    model_type='self-hosted'
)
print(f"Self-hosted cost: ${self_hosted_cost}/month")

Domain-Specific Considerations

Code Search

Use code-specific models:

# For code similarity
model = SentenceTransformer('microsoft/codebert-base')

# Or fine-tune on your codebase
from sentence_transformers import InputExample

train_examples = [
    InputExample(texts=['def hello():', 'function hello() {}']),
    # ... more code pairs
]

E-commerce

Prioritize product attribute understanding:

# Fine-tune on product descriptions
model = SentenceTransformer('all-mpnet-base-v2')

# Train with product-specific data
train_data = [
    ("red cotton t-shirt", "scarlet cotton tee"),
    ("wireless headphones", "bluetooth earphones"),
]

Legal/Medical

Critical: Use domain-specific or fine-tuned models:

# Medical
model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')

# Legal
# Fine-tune on legal corpus (case law, contracts)

Migration Strategy

Switching Models

When upgrading models, re-embed all documents:

def migrate_embeddings(old_model, new_model, documents):
    """Migrate from one model to another"""
    
    # Load new model
    new_encoder = SentenceTransformer(new_model)
    
    # Re-embed all documents
    new_embeddings = new_encoder.encode(
        documents,
        batch_size=32,
        show_progress_bar=True
    )
    
    # Update vector database
    update_vector_db(new_embeddings)
    
    return new_embeddings

# Usage
migrate_embeddings(
    old_model='all-MiniLM-L6-v2',
    new_model='all-mpnet-base-v2',
    documents=all_docs
)

A/B Testing

Test new models before full migration:

def ab_test_models(query, model_a, model_b):
    """Compare two models on same query"""
    
    # Route 50% to each model
    import random
    model = model_a if random.random() < 0.5 else model_b
    
    results = search_with_model(query, model)
    
    # Log for analysis
    log_search(query, model, results, user_satisfaction)
    
    return results

Performance Optimization

Model Quantization

Reduce model size with minimal quality loss:

from optimum.onnxruntime import ORTModelForFeatureExtraction

# Convert to ONNX (faster inference)
model = ORTModelForFeatureExtraction.from_pretrained(
    'all-mpnet-base-v2',
    export=True
)

# 2-3x faster inference with minimal accuracy loss

Caching Strategy

Cache embeddings for common queries:

from functools import lru_cache

@lru_cache(maxsize=1000)
def get_query_embedding(query):
    return model.encode(query)

# Subsequent calls with same query are instant

Decision Checklist

Defined latency requirements
Measured query volume
Calculated cost (API vs. self-hosted)
Created evaluation dataset
Benchmarked 2-3 candidate models
Tested on real user queries
Considered domain-specific needs
Planned migration strategy

Next Steps

Embedding Fundamentals - Understand how embeddings work
Fine-Tuning Embeddings - Improve model for your domain
Multilingual Embeddings - Handle multiple languages

Choosing Embedding Models

Overview

Decision Framework

1. Define Your Requirements

Model Categories

Open-Source Models (Self-Hosted)

Small & Fast (384 dimensions)

Balanced (768 dimensions)

High Quality (768-1024 dimensions)

API-Based Models

OpenAI Embeddings

Cohere Embeddings

Selection Matrix

Evaluation Methodology

1. Create Test Dataset

2. Benchmark Models

3. Cost Analysis

Domain-Specific Considerations

Code Search

E-commerce

Legal/Medical

Migration Strategy

Switching Models

A/B Testing

Performance Optimization

Model Quantization

Caching Strategy

Decision Checklist

Next Steps

Additional Resources