Evaluating RAG Systems
Systematic evaluation framework using synthetic data, retrieval metrics, and statistical validation
Overview
Rather than randomly trying different retrieval techniques and hoping for improvement, you need a rigorous testing framework that measures performance objectively. This ensures every change you make is backed by data, not intuition.
The evaluation framework consists of three essential components:
- Synthetic question generation for creating challenging test cases
- Benchmarking tools for measuring retrieval performance across different approaches
- Statistical validation to ensure improvements are real rather than random variation
Key Concepts
Synthetic Question Generation
Creating realistic test questions that challenge retrieval systems without manual annotation. This involves:
- Using LLMs to generate diverse, challenging questions from your documents
- Ensuring question variety through randomized constraints
- Asynchronous processing with rate limiting for scale
Retrieval Metrics
Quantitative measures that objectively assess retrieval quality:
-
Recall@k: Measures whether the correct document appears in the top k results
- Formula: (# of queries with correct doc in top k) / (total queries)
- Higher is better, indicates system finds relevant documents
-
MRR@k (Mean Reciprocal Rank): Measures how highly the correct document is ranked
- Formula: Average of (1 / rank of first correct document)
- Ranges from 0 to 1, higher means better ranking
- Penalizes systems that rank correct documents lower
Statistical Validation
Using bootstrapping and significance tests to ensure improvements aren't due to random chance:
- Bootstrapping: Simulating experimental variation to estimate confidence
- Confidence Intervals: Quantifying uncertainty in your measurements
- T-tests: Determining if differences between approaches are statistically significant
Implementation Guide
1. Generate Synthetic Questions
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
class SyntheticQuestion(BaseModel):
question: str = Field(description="A challenging question about the document")
expected_chunk_id: str = Field(description="ID of chunk that answers this")
difficulty: str = Field(description="easy, medium, or hard")
client = instructor.from_openai(OpenAI())
def generate_questions(chunks, num_questions=100):
questions = []
for chunk in chunks:
response = client.chat.completions.create(
model="gpt-4",
response_model=SyntheticQuestion,
messages=[
{"role": "system", "content": "Generate challenging questions"},
{"role": "user", "content": f"Document: {chunk.content}"}
]
)
questions.append(response)
return questions
2. Benchmark Retrieval Strategies
import lancedb
from typing import List
def benchmark_retrieval(questions: List[SyntheticQuestion], k: int = 5):
results = []
for question in questions:
# Retrieve top k documents
retrieved = vector_db.search(question.question).limit(k).to_list()
# Check if correct chunk is in results
chunk_ids = [doc['id'] for doc in retrieved]
is_correct = question.expected_chunk_id in chunk_ids
# Calculate rank if found
rank = chunk_ids.index(question.expected_chunk_id) + 1 if is_correct else None
results.append({
'question': question.question,
'recall': 1 if is_correct else 0,
'reciprocal_rank': 1/rank if rank else 0
})
# Calculate overall metrics
recall_at_k = sum(r['recall'] for r in results) / len(results)
mrr_at_k = sum(r['reciprocal_rank'] for r in results) / len(results)
return recall_at_k, mrr_at_k, results
3. Statistical Validation
import numpy as np
from scipy import stats
def bootstrap_confidence_interval(scores, num_bootstrap=1000, confidence=0.95):
"""Calculate confidence interval using bootstrapping"""
bootstrap_means = []
for _ in range(num_bootstrap):
# Resample with replacement
sample = np.random.choice(scores, size=len(scores), replace=True)
bootstrap_means.append(np.mean(sample))
# Calculate percentiles for confidence interval
lower = (1 - confidence) / 2
upper = 1 - lower
ci = np.percentile(bootstrap_means, [lower * 100, upper * 100])
return ci
def compare_approaches(baseline_scores, improved_scores):
"""Compare two approaches using t-test"""
t_stat, p_value = stats.ttest_rel(baseline_scores, improved_scores)
is_significant = p_value < 0.05
improvement = np.mean(improved_scores) - np.mean(baseline_scores)
return {
'improvement': improvement,
'p_value': p_value,
'significant': is_significant
}
Experiment Tracking
Use tools like Braintrust to systematically record and compare different approaches:
from braintrust import init_experiment
experiment = init_experiment("rag-evaluation")
# Log baseline run
experiment.log(
inputs={"query": question.question},
output=retrieved_docs,
scores={"recall@5": recall, "mrr@5": mrr},
metadata={"strategy": "baseline"}
)
Expected Outcomes
After implementing this evaluation framework, you'll have:
- A robust synthetic question generation pipeline producing hundreds of diverse test cases
- Clear performance metrics showing baseline retrieval capabilities
- Statistical validation tools proving which improvements are significant
- Experience with modern ML experiment tracking and visualization
Common Issues and Solutions
Issue: Rate limiting when generating synthetic questions
Solution: Implement exponential backoff and use async rate limiting patterns. Consider using open-source models deployed on Modal for higher throughput.
Issue: Inconsistent results between benchmark runs
Solution: Ensure you're using the same random seed for reproducibility. Check that your evaluation dataset hasn't changed between runs.
Issue: Statistical tests showing no significant differences
Solution: Increase your sample size or ensure your test cases are sufficiently challenging to reveal performance differences.
Common Questions
"Should I use DSpy for prompt optimization?"
DSpy can be useful for specific cases like 35-class classification tasks where you only care about accuracy. However, for most RAG systems, manually tweaking prompts builds more valuable intuition.
The real insight comes from looking at data, understanding customer needs, and identifying system mistakes. Your product isn't just a prompt—it includes how you collect feedback, set UI expectations, extract data, and represent chunks in context.
From Production: "If I'm building a model to extract sales insights from a transcript, I don't have a dataset of 'here's all the sales insights.' The real work is extracting everything and hand-labeling some stuff. Because these tasks are very hard to hill-climb, tools like DSpy don't work as well."
When DSpy makes sense:
- You have very specific evaluations (e.g., classification accuracy)
- You're building LLM-as-judge systems and want to align with your grading
- You have 100+ labeled examples and clear success metrics
"What really matters for evaluation metrics?"
The absolute number matters less than the direction of change. It's like weighing yourself—the scale might vary, but if you've gained two pounds, you've definitely gained two pounds.
Focus on whether interventions move metrics in a positive direction. If adding a re-ranker improves Recall@5 from 0.75 to 0.82, that 7% improvement is meaningful regardless of the absolute values.
"How do I handle one-to-one answer scenarios?"
For questions with exactly one correct answer, Recall@k will be either 0% or 100% depending on whether K is large enough. The metrics become more meaningful when:
- There are multiple relevant documents
- You're analyzing trends across many queries
- You're comparing different retrieval methods
Even with one-to-one mappings, MRR (Mean Reciprocal Rank) is still useful to see where the correct answer appears in your results.
"How do I avoid synthetic data distribution mismatch?"
Check these simple statistics:
- Character count variance: If customer questions average 30 characters but synthetic data averages 90, the LLM is too verbose
- Embedding variance: Check if synthetic embeddings are too similar to each other
- Question patterns: Ensure synthetic questions match real user query patterns
Practical Tip: Incorporate real-world examples from users into your few-shot examples for synthetic data generation to create more diverse, realistic questions.
Next Steps
- Set up experiment tracking with Braintrust or similar tools
- Generate your first synthetic evaluation dataset
- Establish baseline metrics before making any optimizations
- Use these metrics to guide all future RAG improvements