Vendor & Technology Evaluation
Framework for evaluating embedding models, vector databases, and LLM providers for RAG systems.
Overview
Choosing the right vendors and technologies is critical for RAG success. This guide provides evaluation criteria for the key components of your RAG stack.
Build vs Buy Decision
When to Build
- Unique requirements not met by existing solutions
- Cost at scale makes self-hosting cheaper (>10M queries/month)
- Data sensitivity requires on-premise deployment
- Deep customization needed for competitive advantage
When to Buy
- Speed to market is critical (<3 months)
- Limited ML expertise on the team
- Standard use cases well-served by existing tools
- Managed services reduce operational burden
Embedding Model Evaluation
Key Criteria
| Criterion | Weight | Evaluation Method |
|---|---|---|
| Retrieval Accuracy | 40% | Benchmark on your data (Recall@5, MRR) |
| Cost | 25% | Price per 1M tokens × expected volume |
| Latency | 20% | P95 latency for batch and real-time |
| Language Support | 10% | Coverage of required languages |
| Licensing | 5% | Commercial use restrictions |
Top Options (2024)
Managed APIs:
- OpenAI
text-embedding-3-large: Best accuracy, $0.13/1M tokens - Cohere Embed v3: Multilingual, $0.10/1M tokens
- Voyage AI: Domain-specific models, $0.12/1M tokens
Self-Hosted:
- BGE-large-en-v1.5: Free, good accuracy, requires GPU
- E5-mistral-7b-instruct: Excellent for long documents
- Multilingual-e5-large: Best for non-English
Evaluation Process
# 1. Create test set from your domain
test_queries = [
("How do I reset my password?", "password_reset_doc.txt"),
# ... 50-100 query-document pairs
]
# 2. Benchmark each model
models = ["openai", "cohere", "bge-large"]
for model in models:
recall_at_5 = evaluate_retrieval(model, test_queries)
latency_p95 = measure_latency(model)
cost_per_1m = calculate_cost(model, expected_volume)
print(f"{model}: Recall@5={recall_at_5}, P95={latency_p95}ms, Cost=${cost_per_1m}")
Vector Database Evaluation
Key Criteria
| Criterion | Weight | Evaluation Method |
|---|---|---|
| Query Performance | 30% | QPS at target scale with your data |
| Cost | 25% | Total cost (storage + compute + egress) |
| Scalability | 20% | Max vectors, horizontal scaling |
| Features | 15% | Metadata filtering, hybrid search |
| Reliability | 10% | SLA, uptime history |
Top Options
Managed Services:
- Pinecone: Easiest to use, $70/month starter, auto-scaling
- Weaviate Cloud: Hybrid search, $25/month starter
- Qdrant Cloud: Fast, $95/month starter, good for high-throughput
Self-Hosted:
- Qdrant: Fast, Docker-friendly, good for on-prem
- Milvus: Highly scalable, complex setup
- pgvector (Postgres): Simple, good for <1M vectors
Decision Matrix
Volume < 1M vectors?
├─ YES → pgvector (simplest, cheapest)
└─ NO → Continue
│
Need hybrid search?
├─ YES → Weaviate or Qdrant
└─ NO → Continue
│
Budget > $500/month?
├─ YES → Pinecone (best DX)
└─ NO → Self-hosted Qdrant
Benchmark Template
# Load test with your expected traffic
import time
from concurrent.futures import ThreadPoolExecutor
def benchmark_vector_db(db, num_vectors, qps_target):
# 1. Insert vectors
start = time.time()
db.insert(vectors, batch_size=1000)
insert_time = time.time() - start
# 2. Query performance
with ThreadPoolExecutor(max_workers=qps_target) as executor:
latencies = list(executor.map(lambda _: db.query(random_vector()), range(1000)))
p50 = percentile(latencies, 50)
p95 = percentile(latencies, 95)
p99 = percentile(latencies, 99)
return {
"insert_time": insert_time,
"p50_latency": p50,
"p95_latency": p95,
"p99_latency": p99
}
LLM Provider Evaluation
Key Criteria
| Criterion | Weight | Evaluation Method |
|---|---|---|
| Answer Quality | 35% | Human eval on 100 test cases |
| Cost | 30% | Price per 1M tokens × usage |
| Latency | 20% | Time to first token (TTFT) |
| Context Window | 10% | Max tokens for RAG context |
| Reliability | 5% | Uptime, rate limits |
Top Options (2024)
| Provider | Model | Context | Cost (Input/Output) | Best For |
|---|---|---|---|---|
| OpenAI | GPT-4o | 128k | $2.50 / $10.00 | General purpose |
| Anthropic | Claude 3.5 Sonnet | 200k | $3.00 / $15.00 | Long documents |
| Gemini 1.5 Pro | 1M | $1.25 / $5.00 | Massive context | |
| Groq | Llama 3 70B | 8k | $0.59 / $0.79 | Speed critical |
Evaluation Framework
# 1. Define test cases
test_cases = [
{
"query": "What is our refund policy?",
"context": "...", # Retrieved documents
"expected_answer": "...",
"rubric": ["accuracy", "completeness", "citation"]
},
# ... 100 test cases
]
# 2. Run evaluation
for provider in ["openai", "anthropic", "google"]:
scores = []
for test in test_cases:
answer = provider.generate(test["query"], test["context"])
score = human_eval(answer, test["expected_answer"], test["rubric"])
scores.append(score)
avg_score = mean(scores)
cost = calculate_cost(provider, test_cases)
print(f"{provider}: Score={avg_score}, Cost=${cost}")
Total Cost of Ownership (TCO)
5-Year TCO Calculation
TCO = Initial Setup + (Annual Operating Costs × 5)
Initial Setup:
- Engineering time: $50k-200k
- Infrastructure setup: $10k-50k
- Data preparation: $20k-100k
Annual Operating Costs:
- Embedding API: $volume × $0.0001
- Vector DB: $1k-10k/year
- LLM API: $volume × $0.01
- Infrastructure: $5k-50k/year
- Maintenance: 0.5 FTE × $150k = $75k
Example: 1M Queries/Month
Managed Stack (Pinecone + OpenAI):
- Embeddings: 1M × $0.0001 = $100/month
- Vector DB: $200/month (Pinecone)
- LLM: 1M × $0.01 = $10,000/month
- Total: ~$125k/year
Self-Hosted Stack:
- Embeddings: Free (self-hosted)
- Vector DB: $500/month (EC2 + storage)
- LLM: $10,000/month (OpenAI)
- Engineering: $75k/year (0.5 FTE)
- Total: ~$200k/year
Conclusion: Managed is cheaper until ~5M queries/month.
Vendor Lock-In Mitigation
Abstraction Strategies
# Use abstraction layer
class EmbeddingProvider(ABC):
@abstractmethod
def embed(self, texts: List[str]) -> np.ndarray:
pass
class OpenAIEmbeddings(EmbeddingProvider):
def embed(self, texts):
return openai.embed(texts)
class CohereEmbeddings(EmbeddingProvider):
def embed(self, texts):
return cohere.embed(texts)
# Easy to swap providers
embedder = OpenAIEmbeddings() # or CohereEmbeddings()
Data Portability
- Export vector DB monthly (backup)
- Store original documents separately
- Use standard formats (JSONL, Parquet)
- Document embedding model version
Evaluation Checklist
Before selecting vendors:
- Benchmark on your actual data (not public datasets)
- Test at expected scale (10x current volume)
- Calculate 5-year TCO for top 3 options
- Review SLAs and uptime history
- Check data residency requirements
- Evaluate support quality (response time, expertise)
- Test migration path (can you switch later?)
- Verify security certifications (SOC2, ISO 27001)
Next Steps
- Decision Framework - Decide if RAG is right for you
- Team Structure - Build the right team
- Cost Optimization - Reduce ongoing costs