RAG Implementation Guide

Overview

Choosing the right vendors and technologies is critical for RAG success. This guide provides evaluation criteria for the key components of your RAG stack.

Build vs Buy Decision

When to Build

Unique requirements not met by existing solutions
Cost at scale makes self-hosting cheaper (>10M queries/month)
Data sensitivity requires on-premise deployment
Deep customization needed for competitive advantage

When to Buy

Speed to market is critical (<3 months)
Limited ML expertise on the team
Standard use cases well-served by existing tools
Managed services reduce operational burden

Embedding Model Evaluation

Key Criteria

Criterion	Weight	Evaluation Method
Retrieval Accuracy	40%	Benchmark on your data (Recall@5, MRR)
Cost	25%	Price per 1M tokens × expected volume
Latency	20%	P95 latency for batch and real-time
Language Support	10%	Coverage of required languages
Licensing	5%	Commercial use restrictions

Top Options (2024)

Managed APIs:

OpenAI text-embedding-3-large: Best accuracy, $0.13/1M tokens
Cohere Embed v3: Multilingual, $0.10/1M tokens
Voyage AI: Domain-specific models, $0.12/1M tokens

Self-Hosted:

BGE-large-en-v1.5: Free, good accuracy, requires GPU
E5-mistral-7b-instruct: Excellent for long documents
Multilingual-e5-large: Best for non-English

Evaluation Process

# 1. Create test set from your domain
test_queries = [
    ("How do I reset my password?", "password_reset_doc.txt"),
    # ... 50-100 query-document pairs
]

# 2. Benchmark each model
models = ["openai", "cohere", "bge-large"]
for model in models:
    recall_at_5 = evaluate_retrieval(model, test_queries)
    latency_p95 = measure_latency(model)
    cost_per_1m = calculate_cost(model, expected_volume)
    
    print(f"{model}: Recall@5={recall_at_5}, P95={latency_p95}ms, Cost=${cost_per_1m}")

Vector Database Evaluation

Key Criteria

Criterion	Weight	Evaluation Method
Query Performance	30%	QPS at target scale with your data
Cost	25%	Total cost (storage + compute + egress)
Scalability	20%	Max vectors, horizontal scaling
Features	15%	Metadata filtering, hybrid search
Reliability	10%	SLA, uptime history

Top Options

Managed Services:

Pinecone: Easiest to use, $70/month starter, auto-scaling
Weaviate Cloud: Hybrid search, $25/month starter
Qdrant Cloud: Fast, $95/month starter, good for high-throughput

Self-Hosted:

Qdrant: Fast, Docker-friendly, good for on-prem
Milvus: Highly scalable, complex setup
pgvector (Postgres): Simple, good for <1M vectors

Decision Matrix

Volume < 1M vectors?
├─ YES → pgvector (simplest, cheapest)
└─ NO → Continue
    │
    Need hybrid search?
    ├─ YES → Weaviate or Qdrant
    └─ NO → Continue
        │
        Budget > $500/month?
        ├─ YES → Pinecone (best DX)
        └─ NO → Self-hosted Qdrant

Benchmark Template

# Load test with your expected traffic
import time
from concurrent.futures import ThreadPoolExecutor

def benchmark_vector_db(db, num_vectors, qps_target):
    # 1. Insert vectors
    start = time.time()
    db.insert(vectors, batch_size=1000)
    insert_time = time.time() - start
    
    # 2. Query performance
    with ThreadPoolExecutor(max_workers=qps_target) as executor:
        latencies = list(executor.map(lambda _: db.query(random_vector()), range(1000)))
    
    p50 = percentile(latencies, 50)
    p95 = percentile(latencies, 95)
    p99 = percentile(latencies, 99)
    
    return {
        "insert_time": insert_time,
        "p50_latency": p50,
        "p95_latency": p95,
        "p99_latency": p99
    }

LLM Provider Evaluation

Key Criteria

Criterion	Weight	Evaluation Method
Answer Quality	35%	Human eval on 100 test cases
Cost	30%	Price per 1M tokens × usage
Latency	20%	Time to first token (TTFT)
Context Window	10%	Max tokens for RAG context
Reliability	5%	Uptime, rate limits

Top Options (2024)

Provider	Model	Context	Cost (Input/Output)	Best For
OpenAI	GPT-4o	128k	$2.50 / $10.00	General purpose
Anthropic	Claude 3.5 Sonnet	200k	$3.00 / $15.00	Long documents
Google	Gemini 1.5 Pro	1M	$1.25 / $5.00	Massive context
Groq	Llama 3 70B	8k	$0.59 / $0.79	Speed critical

Evaluation Framework

# 1. Define test cases
test_cases = [
    {
        "query": "What is our refund policy?",
        "context": "...",  # Retrieved documents
        "expected_answer": "...",
        "rubric": ["accuracy", "completeness", "citation"]
    },
    # ... 100 test cases
]

# 2. Run evaluation
for provider in ["openai", "anthropic", "google"]:
    scores = []
    for test in test_cases:
        answer = provider.generate(test["query"], test["context"])
        score = human_eval(answer, test["expected_answer"], test["rubric"])
        scores.append(score)
    
    avg_score = mean(scores)
    cost = calculate_cost(provider, test_cases)
    print(f"{provider}: Score={avg_score}, Cost=${cost}")

Total Cost of Ownership (TCO)

5-Year TCO Calculation

TCO = Initial Setup + (Annual Operating Costs × 5)

Initial Setup:
- Engineering time: $50k-200k
- Infrastructure setup: $10k-50k
- Data preparation: $20k-100k

Annual Operating Costs:
- Embedding API: $volume × $0.0001
- Vector DB: $1k-10k/year
- LLM API: $volume × $0.01
- Infrastructure: $5k-50k/year
- Maintenance: 0.5 FTE × $150k = $75k

Example: 1M Queries/Month

Managed Stack (Pinecone + OpenAI):

Embeddings: 1M × $0.0001 = $100/month
Vector DB: $200/month (Pinecone)
LLM: 1M × $0.01 = $10,000/month
Total: ~$125k/year

Self-Hosted Stack:

Embeddings: Free (self-hosted)
Vector DB: $500/month (EC2 + storage)
LLM: $10,000/month (OpenAI)
Engineering: $75k/year (0.5 FTE)
Total: ~$200k/year

Conclusion: Managed is cheaper until ~5M queries/month.

Vendor Lock-In Mitigation

Abstraction Strategies

# Use abstraction layer
class EmbeddingProvider(ABC):
    @abstractmethod
    def embed(self, texts: List[str]) -> np.ndarray:
        pass

class OpenAIEmbeddings(EmbeddingProvider):
    def embed(self, texts):
        return openai.embed(texts)

class CohereEmbeddings(EmbeddingProvider):
    def embed(self, texts):
        return cohere.embed(texts)

# Easy to swap providers
embedder = OpenAIEmbeddings()  # or CohereEmbeddings()

Data Portability

Export vector DB monthly (backup)
Store original documents separately
Use standard formats (JSONL, Parquet)
Document embedding model version

Evaluation Checklist

Before selecting vendors:

Benchmark on your actual data (not public datasets)
Test at expected scale (10x current volume)
Calculate 5-year TCO for top 3 options
Review SLAs and uptime history
Check data residency requirements
Evaluate support quality (response time, expertise)
Test migration path (can you switch later?)
Verify security certifications (SOC2, ISO 27001)

Next Steps

Decision Framework - Decide if RAG is right for you
Team Structure - Build the right team
Cost Optimization - Reduce ongoing costs