Query Expansion
Improve recall by generating multiple variations of user queries
Overview
Query expansion improves retrieval recall by transforming a single user query into multiple related queries. This helps overcome the vocabulary mismatch problem where users express concepts differently than how they appear in documents.
The Problem: Vocabulary Mismatch
Users and documents often use different terminology:
# User query
"How do I speed up my Python code?"
# Document titles that might be missed
- "Python performance optimization"
- "Accelerating Python programs"
- "Making Python run faster"
- "Python profiling and tuning"
A single embedding may not capture all these variations.
Query Expansion Strategies
1. Multi-Query Generation
Generate multiple perspectives of the same question:
from openai import OpenAI
from typing import List
client = OpenAI()
def generate_multiple_queries(query: str, num_queries: int = 3) -> List[str]:
"""Generate multiple variations of a query"""
prompt = f"""You are an AI assistant that generates multiple search queries.
Given the original query, generate {num_queries} different variations that capture the same intent but use different wording.
Original query: {query}
Generate {num_queries} alternative queries (one per line):"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
queries = response.choices[0].message.content.strip().split('\n')
queries = [q.strip('- ').strip() for q in queries if q.strip()]
# Include original query
return [query] + queries[:num_queries]
# Example
original = "What is machine learning?"
expanded_queries = generate_multiple_queries(original)
print("Expanded Queries:")
for q in expanded_queries:
print(f" - {q}")
Output:
Expanded Queries:
- What is machine learning?
- How would you define machine learning?
- Can you explain the concept of machine learning?
- What does machine learning mean?
2. Query Decomposition
Break complex queries into simpler sub-queries:
def decompose_query(query: str) -> List[str]:
"""Decompose complex query into sub-queries"""
prompt = f"""Break down this complex query into 2-4 simpler, more specific sub-queries that together address the original question.
Original query: {query}
Sub-queries (one per line):"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
sub_queries = response.choices[0].message.content.strip().split('\n')
sub_queries = [q.strip('- ').strip() for q in sub_queries if q.strip()]
return sub_queries
# Example
complex_query = "How do I build a production-ready RAG system with good accuracy?"
sub_queries = decompose_query(complex_query)
print("Sub-queries:")
for q in sub_queries:
print(f" - {q}")
Output:
Sub-queries:
- What are the components of a RAG system?
- How do I improve RAG accuracy?
- What are production best practices for RAG?
- How do I deploy and scale a RAG system?
3. Step-Back Prompting
Generate broader, more general queries:
def generate_stepback_query(query: str) -> str:
"""Generate a broader, more general version of the query"""
prompt = f"""Given a specific query, generate a broader, more general question that addresses the underlying concepts.
Specific query: {query}
General question:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content.strip()
# Example
specific = "How do I fix a CUDA out of memory error in PyTorch?"
general = generate_stepback_query(specific)
print(f"Specific: {specific}")
print(f"General: {general}")
Output:
Specific: How do I fix a CUDA out of memory error in PyTorch?
General: How does GPU memory management work in deep learning frameworks?
4. HyDE (Hypothetical Document Embeddings)
Generate hypothetical answers, then search for them:
def hyde_search(query: str, model, vector_db):
"""Search using hypothetical document embeddings"""
# Step 1: Generate hypothetical answer
prompt = f"""Write a detailed answer to this question as if you were an expert:
Question: {query}
Answer:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=200
)
hypothetical_doc = response.choices[0].message.content.strip()
# Step 2: Search using the hypothetical answer
doc_embedding = model.encode(hypothetical_doc)
results = vector_db.search(doc_embedding).limit(5).to_list()
return results
# This works because documents contain answers, not questions
# So searching with answer-like text finds better matches
Implementation: Complete Query Expansion System
from sentence_transformers import SentenceTransformer
import lancedb
from typing import List, Dict
import numpy as np
class QueryExpansionRetriever:
def __init__(self, db_path: str = "./vector-db"):
self.model = SentenceTransformer('all-mpnet-base-v2')
self.db = lancedb.connect(db_path)
self.client = OpenAI()
def search_with_expansion(
self,
query: str,
k: int = 5,
expansion_method: str = "multi-query"
) -> List[Dict]:
"""Search with query expansion"""
# Step 1: Expand query
if expansion_method == "multi-query":
queries = self.generate_multiple_queries(query, num_queries=3)
elif expansion_method == "decompose":
queries = self.decompose_query(query)
elif expansion_method == "stepback":
queries = [query, self.generate_stepback_query(query)]
elif expansion_method == "hyde":
return self.hyde_search(query, k)
else:
queries = [query]
# Step 2: Search with each query
all_results = []
seen_ids = set()
for q in queries:
q_embedding = self.model.encode(q)
table = self.db.open_table("documents")
results = table.search(q_embedding).limit(k).to_list()
# Deduplicate
for result in results:
if result['id'] not in seen_ids:
all_results.append(result)
seen_ids.add(result['id'])
# Step 3: Re-rank by original query
if len(all_results) > k:
all_results = self.rerank_by_original_query(query, all_results, k)
return all_results[:k]
def generate_multiple_queries(self, query: str, num_queries: int = 3) -> List[str]:
"""Generate query variations"""
prompt = f"""Generate {num_queries} different variations of this query:
Query: {query}
Variations (one per line):"""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
queries = response.choices[0].message.content.strip().split('\n')
queries = [q.strip('- ').strip() for q in queries if q.strip()]
return [query] + queries[:num_queries]
def decompose_query(self, query: str) -> List[str]:
"""Break down complex query"""
prompt = f"""Break this query into 2-4 simpler sub-queries:
Query: {query}
Sub-queries:"""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
sub_queries = response.choices[0].message.content.strip().split('\n')
return [q.strip('- ').strip() for q in sub_queries if q.strip()]
def generate_stepback_query(self, query: str) -> str:
"""Generate broader query"""
prompt = f"""Generate a broader, more general version:
Specific: {query}
General:"""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content.strip()
def hyde_search(self, query: str, k: int) -> List[Dict]:
"""HyDE search"""
# Generate hypothetical answer
prompt = f"""Write a detailed answer to: {query}"""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
hypothetical_doc = response.choices[0].message.content.strip()
# Search with hypothetical document
doc_embedding = self.model.encode(hypothetical_doc)
table = self.db.open_table("documents")
return table.search(doc_embedding).limit(k).to_list()
def rerank_by_original_query(
self,
query: str,
results: List[Dict],
k: int
) -> List[Dict]:
"""Re-rank results by similarity to original query"""
query_emb = self.model.encode(query)
# Calculate scores
for result in results:
doc_emb = np.array(result['vector'])
similarity = np.dot(query_emb, doc_emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
)
result['rerank_score'] = similarity
# Sort by score
results.sort(key=lambda x: x['rerank_score'], reverse=True)
return results[:k]
# Usage
retriever = QueryExpansionRetriever()
# Multi-query expansion
results = retriever.search_with_expansion(
"How to optimize Python code?",
k=5,
expansion_method="multi-query"
)
# Query decomposition
results = retriever.search_with_expansion(
"How do I build and deploy a production RAG system?",
k=5,
expansion_method="decompose"
)
# HyDE
results = retriever.search_with_expansion(
"What is transfer learning?",
k=5,
expansion_method="hyde"
)
When to Use Each Method
Multi-Query Generation
Best for: General questions with multiple valid phrasings
Example: "What is machine learning?" → variations with synonyms
Query Decomposition
Best for: Complex, multi-part questions
Example: "How do I train and deploy a model?" → separate sub-queries
Step-Back Prompting
Best for: Specific technical questions that need broader context
Example: "CUDA out of memory error" → "GPU memory management"
HyDE
Best for: Conceptual questions where documents contain answers
Example: "What is RAG?" → Generate answer-like text to find similar docs
Evaluation
def evaluate_expansion(
test_queries: List[str],
ground_truth: Dict[str, List[str]],
expansion_method: str
):
"""Evaluate query expansion impact"""
retriever = QueryExpansionRetriever()
# Baseline (no expansion)
baseline_recall = []
for query in test_queries:
results = retriever.search_with_expansion(query, k=5, expansion_method=None)
result_ids = [r['id'] for r in results]
relevant = ground_truth[query]
recall = len(set(result_ids) & set(relevant)) / len(relevant)
baseline_recall.append(recall)
# With expansion
expansion_recall = []
for query in test_queries:
results = retriever.search_with_expansion(
query, k=5, expansion_method=expansion_method
)
result_ids = [r['id'] for r in results]
relevant = ground_truth[query]
recall = len(set(result_ids) & set(relevant)) / len(relevant)
expansion_recall.append(recall)
print(f"Baseline Recall@5: {np.mean(baseline_recall):.3f}")
print(f"Expansion Recall@5: {np.mean(expansion_recall):.3f}")
print(f"Improvement: {(np.mean(expansion_recall) - np.mean(baseline_recall)):.3f}")
Best Practices
- Combine with re-ranking: Expand queries to increase recall, then re-rank to maintain precision
- Limit expansion: 3-5 query variations is usually optimal
- Cache expansions: LLM calls are expensive - cache common queries
- A/B test: Different methods work better for different domains
- Monitor costs: Query expansion increases LLM API calls
Common Issues
1. Too Many Results
Problem: Expansion returns too many irrelevant documents
Solution:
- Reduce number of query variations
- Use stricter re-ranking
- Apply similarity threshold
2. Slow Performance
Problem: Multiple searches and LLM calls add latency
Solution:
- Run searches in parallel
- Cache query expansions
- Use smaller/faster LLM for expansion
3. Loss of Precision
Problem: More results but lower relevance
Solution:
- Always re-rank by original query
- Use cross-encoder for final ranking
- Adjust number of expansions based on query complexity
Next Steps
- Retrieval Fundamentals - Core vector search concepts
- MMR - Diversify search results
- Hybrid Search - Combine semantic and keyword search
- Parent Document Retrieval - Better context