Query Understanding and Classification
Discover user query patterns using topic modeling and build classification systems
Overview
Understanding how users interact with your RAG system reveals where improvements matter most. By applying topic modeling and classification to user queries, you can identify patterns that show which areas need attention. This data-driven approach prioritizes optimizations based on actual user needs rather than technical assumptions.
The methodology combines:
- Unsupervised learning to discover natural query clusters
- Supervised classification to track patterns over time
- Satisfaction analysis to identify high-impact improvement areas
Why Query Understanding Matters
- Prioritize Improvements: Focus on high-volume, low-satisfaction query types
- Identify Patterns: Discover common user intents and pain points
- Monitor Changes: Track how query patterns evolve over time
- Data-Driven Decisions: Base optimizations on actual user behavior
Key Concepts
Topic Modeling with BERTopic
Unsupervised discovery of themes and patterns in user queries using:
- BERT Embeddings: Capture semantic meaning of queries
- UMAP: Dimensionality reduction preserving local and global structure
- HDBSCAN: Density-based clustering finding topics of varying sizes
Persona-Based Analysis
Understanding different user types and communication styles:
- Technical users: Precise, jargon-heavy queries
- Casual users: Conversational, natural language queries
- Frustrated users: Repetitive or emotional queries
Intent Classification
Categorizing queries by their underlying purpose:
- Informational: "What is...?", "How does...?"
- Navigational: "Where can I find...?"
- Transactional: "I want to...", "Help me with..."
Implementation Guide
1. Generate Diverse Query Data
import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List
class SyntheticQuery(BaseModel):
query: str
persona: str # technical, casual, frustrated
intent: str # informational, navigational, transactional
satisfaction_score: float # 1-5 scale
source_document: str
client = instructor.from_openai(OpenAI())
def generate_queries(documents: List[str], num_per_doc: int = 10):
"""Generate diverse queries simulating different user behaviors"""
queries = []
personas = ["technical_user", "casual_user", "frustrated_user"]
intents = ["informational", "navigational", "transactional"]
for doc in documents:
for persona in personas:
for intent in intents:
query = client.chat.completions.create(
model="gpt-4",
response_model=SyntheticQuery,
messages=[
{"role": "system", "content": f"Generate a {persona} query with {intent} intent"},
{"role": "user", "content": f"Document: {doc}"}
]
)
queries.append(query)
return queries
2. Apply Topic Modeling
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
def discover_query_topics(queries: List[str]):
"""Discover natural clusters in user queries"""
# Configure embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Configure dimensionality reduction
umap_model = UMAP(
n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine'
)
# Configure clustering
hdbscan_model = HDBSCAN(
min_cluster_size=10,
metric='euclidean',
cluster_selection_method='eom'
)
# Create and fit topic model
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
verbose=True
)
topics, probs = topic_model.fit_transform(queries)
return topic_model, topics
3. Analyze Satisfaction by Topic
import pandas as pd
import matplotlib.pyplot as plt
def analyze_satisfaction_by_topic(queries_df, topics):
"""Identify problematic query patterns"""
# Add topics to dataframe
queries_df['topic'] = topics
# Calculate metrics per topic
topic_analysis = queries_df.groupby('topic').agg({
'satisfaction_score': ['mean', 'count'],
'query': 'count'
}).round(2)
# Identify high-impact improvement areas
# (high volume + low satisfaction)
topic_analysis['priority'] = (
topic_analysis['query']['count'] / topic_analysis['satisfaction_score']['mean']
)
# Sort by priority
high_priority = topic_analysis.sort_values('priority', ascending=False).head(10)
return high_priority
4. Build Query Classifier
from pydantic import BaseModel, Field
from typing import Literal
import yaml
# Define classification taxonomy
class QueryCategory(BaseModel):
primary_intent: Literal["informational", "navigational", "transactional"]
topic_area: str = Field(description="Main topic of the query")
complexity: Literal["simple", "moderate", "complex"]
requires_context: bool = Field(description="Whether query needs conversation history")
def classify_query(query: str, taxonomy: dict) -> QueryCategory:
"""Classify incoming queries for monitoring"""
client = instructor.from_openai(OpenAI())
classification = client.chat.completions.create(
model="gpt-4",
response_model=QueryCategory,
messages=[
{"role": "system", "content": f"Classify queries according to: {yaml.dump(taxonomy)}"},
{"role": "user", "content": f"Query: {query}"}
]
)
return classification
Visualization and Analysis
Topic Distribution
def visualize_topics(topic_model, queries):
"""Visualize topic relationships and distributions"""
# Topic representation
topic_info = topic_model.get_topic_info()
print(f"Discovered {len(topic_info)} topics")
# Visualize topic distances
topic_model.visualize_topics()
# Visualize topic hierarchy
topic_model.visualize_hierarchy()
# Document clustering
topic_model.visualize_documents(queries)
Satisfaction Heatmap
import seaborn as sns
def create_satisfaction_heatmap(queries_df):
"""Visualize satisfaction across topics and personas"""
pivot_table = queries_df.pivot_table(
values='satisfaction_score',
index='topic',
columns='persona',
aggfunc='mean'
)
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, cmap='RdYlGn', center=3.0)
plt.title('Average Satisfaction by Topic and Persona')
plt.show()
Production Monitoring System
class QueryMonitor:
"""Monitor and classify queries in production"""
def __init__(self, topic_model, classifier, taxonomy):
self.topic_model = topic_model
self.classifier = classifier
self.taxonomy = taxonomy
self.query_log = []
def process_query(self, query: str, satisfaction: float = None):
"""Process incoming query and log for analysis"""
# Classify query
category = self.classifier(query, self.taxonomy)
# Assign topic (if topic model is trained)
topic = self.topic_model.transform([query])[0][0]
# Log for analysis
self.query_log.append({
'query': query,
'topic': topic,
'category': category.dict(),
'satisfaction': satisfaction,
'timestamp': pd.Timestamp.now()
})
def get_insights(self):
"""Generate insights from collected data"""
df = pd.DataFrame(self.query_log)
# Identify trending topics
recent = df[df['timestamp'] > pd.Timestamp.now() - pd.Timedelta(days=7)]
trending = recent.groupby('topic').size().sort_values(ascending=False)
# Identify problem areas
if 'satisfaction' in df.columns:
problems = df.groupby('topic')['satisfaction'].mean().sort_values()
return {'trending': trending, 'problems': problems}
Expected Outcomes
After implementing query understanding, you'll have:
- Clear visibility into most common query patterns
- Identification of high-volume, low-satisfaction areas
- Data-driven priorities for RAG improvements
- Production monitoring system for ongoing insights
- Understanding of different user personas and their needs
Common Patterns to Watch For
High-Impact Issues
- Ambiguous Queries: Users asking vague questions that need clarification
- Out-of-Scope Queries: Questions your system can't answer
- Terminology Mismatches: Users using different terms than your documents
- Multi-Step Queries: Complex questions requiring multiple retrievals
User Behavior Signals
- Repetitive queries: User dissatisfaction with initial responses
- Query reformulation: Users trying different phrasings for same need
- Increasing specificity: Users adding more details after poor results
Next Steps
- Generate synthetic query dataset representing your users
- Apply BERTopic to discover natural query clusters
- Analyze correlation between topics and satisfaction
- Build classification system for production monitoring
- Use insights to prioritize retrieval improvements