Retrieval Fundamentals
Core concepts of vector search and semantic retrieval for RAG systems
Overview
Vector search (also called semantic search) is the foundation of modern RAG systems. Unlike keyword search that matches exact terms, vector search understands meaning and finds semantically similar content.
How Vector Search Works
1. Document Indexing
Convert documents to vectors and store them:
from sentence_transformers import SentenceTransformer
import lancedb
# Initialize embedding model
model = SentenceTransformer('all-mpnet-base-v2')
# Documents to index
documents = [
"Python is a programming language",
"Machine learning uses algorithms to learn patterns",
"RAG combines retrieval with generation"
]
# Create embeddings
embeddings = model.encode(documents)
# Store in vector database
db = lancedb.connect("./vector-db")
data = [
{"id": i, "text": doc, "vector": emb.tolist()}
for i, (doc, emb) in enumerate(zip(documents, embeddings))
]
table = db.create_table("docs", data)
2. Query Processing
Convert query to vector:
query = "What is Python?"
query_vector = model.encode(query)
3. Similarity Search
Find nearest neighbors in vector space:
# Search for top 5 most similar documents
results = table.search(query_vector).limit(5).to_list()
for result in results:
print(f"Score: {result['_distance']:.3f}")
print(f"Text: {result['text']}\n")
Distance Metrics
Cosine Similarity
Measures angle between vectors (most common):
import numpy as np
def cosine_similarity(vec1, vec2):
"""Calculate cosine similarity (-1 to 1)"""
dot_product = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
return dot_product / (norm1 * norm2)
# Higher is more similar
similarity = cosine_similarity(query_vector, doc_vector)
print(f"Similarity: {similarity:.3f}") # 0.85 = very similar
Properties:
- Range: -1 (opposite) to 1 (identical)
- Ignores magnitude, only direction matters
- Best for normalized embeddings
Euclidean Distance
Measures straight-line distance:
def euclidean_distance(vec1, vec2):
"""Calculate Euclidean distance"""
return np.linalg.norm(vec1 - vec2)
# Lower is more similar
distance = euclidean_distance(query_vector, doc_vector)
print(f"Distance: {distance:.3f}") # 0.2 = very similar
Properties:
- Range: 0 (identical) to ∞
- Considers magnitude
- Sensitive to vector length
Dot Product
Fast similarity measure for normalized vectors:
def dot_product_similarity(vec1, vec2):
"""Calculate dot product"""
return np.dot(vec1, vec2)
# For normalized vectors, equivalent to cosine similarity
similarity = dot_product_similarity(query_vector, doc_vector)
When to use: Vectors are already normalized (most embedding models do this).
Retrieval Strategies
Top-K Retrieval
Retrieve K most similar documents:
def retrieve_top_k(query, k=5):
"""Retrieve top K most similar documents"""
query_vector = model.encode(query)
results = table.search(query_vector).limit(k).to_list()
return results
# Get top 5 results
top_docs = retrieve_top_k("machine learning", k=5)
Use case: Standard retrieval for most RAG applications.
Threshold-Based Retrieval
Only return documents above similarity threshold:
def retrieve_above_threshold(query, threshold=0.7):
"""Retrieve documents above similarity threshold"""
query_vector = model.encode(query)
# Get more results than needed
results = table.search(query_vector).limit(100).to_list()
# Filter by threshold
filtered = [
r for r in results
if r['_distance'] <= (1 - threshold) # Convert to similarity
]
return filtered
# Only highly relevant documents
relevant_docs = retrieve_above_threshold("Python programming", threshold=0.75)
Use case: When precision is more important than recall.
Adaptive Retrieval
Adjust K based on query complexity:
def adaptive_retrieve(query):
"""Adjust retrieval count based on query"""
query_vector = model.encode(query)
# Simple query: fewer results
if len(query.split()) <= 5:
k = 3
# Complex query: more results
else:
k = 10
results = table.search(query_vector).limit(k).to_list()
return results
Indexing Strategies
Flat Index
Store all vectors, search exhaustively:
# Simple, exact search
# Good for: < 1M vectors
# Speed: O(n) - linear with dataset size
table = db.create_table("docs", data)
results = table.search(query_vector).limit(5).to_list()
Pros:
- Exact results
- Simple to implement
- No training needed
Cons:
- Slow for large datasets
- Doesn't scale beyond millions
Approximate Nearest Neighbor (ANN)
Trade accuracy for speed using indexes:
import faiss
# Create FAISS index
dimension = 768
index = faiss.IndexFlatL2(dimension)
# Add vectors
index.add(embeddings)
# Search (approximate)
k = 5
distances, indices = index.search(query_vector.reshape(1, -1), k)
Popular ANN algorithms:
-
HNSW (Hierarchical Navigable Small World)
- Fast, accurate
- High memory usage
-
IVF (Inverted File Index)
- Memory efficient
- Requires training
-
LSH (Locality Sensitive Hashing)
- Very fast
- Lower accuracy
Partitioning
Split data into partitions for faster search:
def partition_by_category(documents):
"""Partition documents by category"""
partitions = {}
for doc in documents:
category = doc['category']
if category not in partitions:
partitions[category] = []
partitions[category].append(doc)
return partitions
# Search only relevant partition
def search_partition(query, category):
"""Search within specific partition"""
partition_table = db.open_table(f"docs_{category}")
query_vector = model.encode(query)
return partition_table.search(query_vector).limit(5).to_list()
Chunking for Retrieval
Fixed-Size Chunks
Split documents into fixed-length chunks:
def chunk_text(text, chunk_size=512, overlap=50):
"""Split text into overlapping chunks"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
# Example
document = "Long document text here..."
chunks = chunk_text(document, chunk_size=512, overlap=50)
# Index each chunk
for i, chunk in enumerate(chunks):
embedding = model.encode(chunk)
# Store with metadata
store_chunk(chunk, embedding, doc_id=doc_id, chunk_id=i)
Pros:
- Predictable size
- Simple to implement
Cons:
- May split semantic units
Semantic Chunks
Split by meaning (paragraphs, sections):
def semantic_chunk(text):
"""Split by paragraphs"""
# Split on double newlines
chunks = text.split('\n\n')
# Filter empty chunks
chunks = [c.strip() for c in chunks if c.strip()]
return chunks
chunks = semantic_chunk(document)
Pros:
- Preserves context
- Better semantic coherence
Cons:
- Variable chunk sizes
Retrieval Metrics
Recall@K
Percentage of relevant documents in top K:
def recall_at_k(retrieved, relevant, k):
"""Calculate Recall@K"""
top_k = retrieved[:k]
relevant_in_top_k = len(set(top_k) & set(relevant))
return relevant_in_top_k / len(relevant)
# Example
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = ['doc2', 'doc5', 'doc7']
recall = recall_at_k(retrieved, relevant, k=5)
print(f"Recall@5: {recall:.2%}") # 66.7% (2 out of 3)
MRR@K (Mean Reciprocal Rank)
Average of reciprocal ranks of first relevant document:
def mrr_at_k(retrieved, relevant, k):
"""Calculate MRR@K"""
top_k = retrieved[:k]
for i, doc in enumerate(top_k):
if doc in relevant:
return 1 / (i + 1)
return 0
# Example
mrr = mrr_at_k(retrieved, relevant, k=5)
print(f"MRR@5: {mrr:.3f}") # 0.500 (first relevant at position 2)
Precision@K
Percentage of relevant documents among retrieved:
def precision_at_k(retrieved, relevant, k):
"""Calculate Precision@K"""
top_k = retrieved[:k]
relevant_in_top_k = len(set(top_k) & set(relevant))
return relevant_in_top_k / k
precision = precision_at_k(retrieved, relevant, k=5)
print(f"Precision@5: {precision:.2%}") # 40% (2 out of 5)
Advanced Techniques
Re-ranking
Two-stage retrieval for better accuracy:
def two_stage_retrieval(query, k_retrieve=100, k_final=5):
"""Fast retrieval + slow re-ranking"""
# Stage 1: Fast retrieval (get more candidates)
query_vector = model.encode(query)
candidates = table.search(query_vector).limit(k_retrieve).to_list()
# Stage 2: Re-rank with better model
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Score query-document pairs
pairs = [[query, c['text']] for c in candidates]
scores = reranker.predict(pairs)
# Sort by score
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:k_final]]
Filtering
Combine vector search with metadata filters:
def filtered_search(query, filters, k=5):
"""Search with metadata filters"""
query_vector = model.encode(query)
# Build filter string
filter_conditions = []
if 'category' in filters:
filter_conditions.append(f"category = '{filters['category']}'")
if 'date_after' in filters:
filter_conditions.append(f"date > '{filters['date_after']}'")
where_clause = " AND ".join(filter_conditions)
# Search with filters
results = table.search(query_vector)\
.where(where_clause)\
.limit(k)\
.to_list()
return results
# Example
results = filtered_search(
"machine learning",
filters={'category': 'technical', 'date_after': '2024-01-01'},
k=5
)
Common Issues
1. Poor Retrieval Quality
Symptoms: Irrelevant results in top K
Solutions:
- Use better embedding model
- Fine-tune embeddings on your domain
- Increase chunk overlap
- Add metadata filtering
2. Slow Search
Symptoms: High latency for queries
Solutions:
- Use ANN index (HNSW, IVF)
- Reduce embedding dimensions
- Partition data by category
- Cache common queries
3. Missing Relevant Documents
Symptoms: Low recall
Solutions:
- Increase K
- Lower similarity threshold
- Improve chunking strategy
- Use query expansion
Best Practices
1. Normalize Embeddings
import numpy as np
def normalize_embedding(embedding):
"""Normalize to unit length"""
return embedding / np.linalg.norm(embedding)
# Use dot product for faster search
normalized_emb = normalize_embedding(embedding)
2. Monitor Performance
import time
def monitor_search(query, k=5):
"""Track search performance"""
start = time.time()
results = retrieve_top_k(query, k)
latency = time.time() - start
# Log metrics
log_metric('search_latency', latency)
log_metric('results_count', len(results))
return results
3. A/B Test Changes
def ab_test_retrieval(query, variant='A'):
"""Test different retrieval strategies"""
if variant == 'A':
# Baseline
return retrieve_top_k(query, k=5)
else:
# New strategy
return two_stage_retrieval(query, k_retrieve=50, k_final=5)
Next Steps
- MMR (Maximal Marginal Relevance) - Diversify search results
- Hybrid Search - Combine semantic and keyword search
- Query Expansion - Retrieve more relevant results
- Parent Document Retrieval - Context without noise