Vendor & Technology Evaluation

Framework for evaluating embedding models, vector databases, and LLM providers for RAG systems.

Overview

Choosing the right vendors and technologies is critical for RAG success. This guide provides evaluation criteria for the key components of your RAG stack.

Build vs Buy Decision

When to Build

  • Unique requirements not met by existing solutions
  • Cost at scale makes self-hosting cheaper (>10M queries/month)
  • Data sensitivity requires on-premise deployment
  • Deep customization needed for competitive advantage

When to Buy

  • Speed to market is critical (<3 months)
  • Limited ML expertise on the team
  • Standard use cases well-served by existing tools
  • Managed services reduce operational burden

Embedding Model Evaluation

Key Criteria

CriterionWeightEvaluation Method
Retrieval Accuracy40%Benchmark on your data (Recall@5, MRR)
Cost25%Price per 1M tokens × expected volume
Latency20%P95 latency for batch and real-time
Language Support10%Coverage of required languages
Licensing5%Commercial use restrictions

Top Options (2024)

Managed APIs:

  • OpenAI text-embedding-3-large: Best accuracy, $0.13/1M tokens
  • Cohere Embed v3: Multilingual, $0.10/1M tokens
  • Voyage AI: Domain-specific models, $0.12/1M tokens

Self-Hosted:

  • BGE-large-en-v1.5: Free, good accuracy, requires GPU
  • E5-mistral-7b-instruct: Excellent for long documents
  • Multilingual-e5-large: Best for non-English

Evaluation Process

# 1. Create test set from your domain
test_queries = [
    ("How do I reset my password?", "password_reset_doc.txt"),
    # ... 50-100 query-document pairs
]

# 2. Benchmark each model
models = ["openai", "cohere", "bge-large"]
for model in models:
    recall_at_5 = evaluate_retrieval(model, test_queries)
    latency_p95 = measure_latency(model)
    cost_per_1m = calculate_cost(model, expected_volume)
    
    print(f"{model}: Recall@5={recall_at_5}, P95={latency_p95}ms, Cost=${cost_per_1m}")

Vector Database Evaluation

Key Criteria

CriterionWeightEvaluation Method
Query Performance30%QPS at target scale with your data
Cost25%Total cost (storage + compute + egress)
Scalability20%Max vectors, horizontal scaling
Features15%Metadata filtering, hybrid search
Reliability10%SLA, uptime history

Top Options

Managed Services:

  • Pinecone: Easiest to use, $70/month starter, auto-scaling
  • Weaviate Cloud: Hybrid search, $25/month starter
  • Qdrant Cloud: Fast, $95/month starter, good for high-throughput

Self-Hosted:

  • Qdrant: Fast, Docker-friendly, good for on-prem
  • Milvus: Highly scalable, complex setup
  • pgvector (Postgres): Simple, good for <1M vectors

Decision Matrix

Volume < 1M vectors?
├─ YES → pgvector (simplest, cheapest)
└─ NO → Continue
    │
    Need hybrid search?
    ├─ YES → Weaviate or Qdrant
    └─ NO → Continue
        │
        Budget > $500/month?
        ├─ YES → Pinecone (best DX)
        └─ NO → Self-hosted Qdrant

Benchmark Template

# Load test with your expected traffic
import time
from concurrent.futures import ThreadPoolExecutor

def benchmark_vector_db(db, num_vectors, qps_target):
    # 1. Insert vectors
    start = time.time()
    db.insert(vectors, batch_size=1000)
    insert_time = time.time() - start
    
    # 2. Query performance
    with ThreadPoolExecutor(max_workers=qps_target) as executor:
        latencies = list(executor.map(lambda _: db.query(random_vector()), range(1000)))
    
    p50 = percentile(latencies, 50)
    p95 = percentile(latencies, 95)
    p99 = percentile(latencies, 99)
    
    return {
        "insert_time": insert_time,
        "p50_latency": p50,
        "p95_latency": p95,
        "p99_latency": p99
    }

LLM Provider Evaluation

Key Criteria

CriterionWeightEvaluation Method
Answer Quality35%Human eval on 100 test cases
Cost30%Price per 1M tokens × usage
Latency20%Time to first token (TTFT)
Context Window10%Max tokens for RAG context
Reliability5%Uptime, rate limits

Top Options (2024)

ProviderModelContextCost (Input/Output)Best For
OpenAIGPT-4o128k$2.50 / $10.00General purpose
AnthropicClaude 3.5 Sonnet200k$3.00 / $15.00Long documents
GoogleGemini 1.5 Pro1M$1.25 / $5.00Massive context
GroqLlama 3 70B8k$0.59 / $0.79Speed critical

Evaluation Framework

# 1. Define test cases
test_cases = [
    {
        "query": "What is our refund policy?",
        "context": "...",  # Retrieved documents
        "expected_answer": "...",
        "rubric": ["accuracy", "completeness", "citation"]
    },
    # ... 100 test cases
]

# 2. Run evaluation
for provider in ["openai", "anthropic", "google"]:
    scores = []
    for test in test_cases:
        answer = provider.generate(test["query"], test["context"])
        score = human_eval(answer, test["expected_answer"], test["rubric"])
        scores.append(score)
    
    avg_score = mean(scores)
    cost = calculate_cost(provider, test_cases)
    print(f"{provider}: Score={avg_score}, Cost=${cost}")

Total Cost of Ownership (TCO)

5-Year TCO Calculation

TCO = Initial Setup + (Annual Operating Costs × 5)

Initial Setup:
- Engineering time: $50k-200k
- Infrastructure setup: $10k-50k
- Data preparation: $20k-100k

Annual Operating Costs:
- Embedding API: $volume × $0.0001
- Vector DB: $1k-10k/year
- LLM API: $volume × $0.01
- Infrastructure: $5k-50k/year
- Maintenance: 0.5 FTE × $150k = $75k

Example: 1M Queries/Month

Managed Stack (Pinecone + OpenAI):

  • Embeddings: 1M × $0.0001 = $100/month
  • Vector DB: $200/month (Pinecone)
  • LLM: 1M × $0.01 = $10,000/month
  • Total: ~$125k/year

Self-Hosted Stack:

  • Embeddings: Free (self-hosted)
  • Vector DB: $500/month (EC2 + storage)
  • LLM: $10,000/month (OpenAI)
  • Engineering: $75k/year (0.5 FTE)
  • Total: ~$200k/year

Conclusion: Managed is cheaper until ~5M queries/month.

Vendor Lock-In Mitigation

Abstraction Strategies

# Use abstraction layer
class EmbeddingProvider(ABC):
    @abstractmethod
    def embed(self, texts: List[str]) -> np.ndarray:
        pass

class OpenAIEmbeddings(EmbeddingProvider):
    def embed(self, texts):
        return openai.embed(texts)

class CohereEmbeddings(EmbeddingProvider):
    def embed(self, texts):
        return cohere.embed(texts)

# Easy to swap providers
embedder = OpenAIEmbeddings()  # or CohereEmbeddings()

Data Portability

  • Export vector DB monthly (backup)
  • Store original documents separately
  • Use standard formats (JSONL, Parquet)
  • Document embedding model version

Evaluation Checklist

Before selecting vendors:

  • Benchmark on your actual data (not public datasets)
  • Test at expected scale (10x current volume)
  • Calculate 5-year TCO for top 3 options
  • Review SLAs and uptime history
  • Check data residency requirements
  • Evaluate support quality (response time, expertise)
  • Test migration path (can you switch later?)
  • Verify security certifications (SOC2, ISO 27001)

Next Steps