Query Understanding and Classification

Discover user query patterns using topic modeling and build classification systems

Overview

Understanding how users interact with your RAG system reveals where improvements matter most. By applying topic modeling and classification to user queries, you can identify patterns that show which areas need attention. This data-driven approach prioritizes optimizations based on actual user needs rather than technical assumptions.

The methodology combines:

  1. Unsupervised learning to discover natural query clusters
  2. Supervised classification to track patterns over time
  3. Satisfaction analysis to identify high-impact improvement areas

Why Query Understanding Matters

  • Prioritize Improvements: Focus on high-volume, low-satisfaction query types
  • Identify Patterns: Discover common user intents and pain points
  • Monitor Changes: Track how query patterns evolve over time
  • Data-Driven Decisions: Base optimizations on actual user behavior

Key Concepts

Topic Modeling with BERTopic

Unsupervised discovery of themes and patterns in user queries using:

  • BERT Embeddings: Capture semantic meaning of queries
  • UMAP: Dimensionality reduction preserving local and global structure
  • HDBSCAN: Density-based clustering finding topics of varying sizes

Persona-Based Analysis

Understanding different user types and communication styles:

  • Technical users: Precise, jargon-heavy queries
  • Casual users: Conversational, natural language queries
  • Frustrated users: Repetitive or emotional queries

Intent Classification

Categorizing queries by their underlying purpose:

  • Informational: "What is...?", "How does...?"
  • Navigational: "Where can I find...?"
  • Transactional: "I want to...", "Help me with..."

Implementation Guide

1. Generate Diverse Query Data

import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List

class SyntheticQuery(BaseModel):
    query: str
    persona: str  # technical, casual, frustrated
    intent: str  # informational, navigational, transactional
    satisfaction_score: float  # 1-5 scale
    source_document: str

client = instructor.from_openai(OpenAI())

def generate_queries(documents: List[str], num_per_doc: int = 10):
    """Generate diverse queries simulating different user behaviors"""
    queries = []
    
    personas = ["technical_user", "casual_user", "frustrated_user"]
    intents = ["informational", "navigational", "transactional"]
    
    for doc in documents:
        for persona in personas:
            for intent in intents:
                query = client.chat.completions.create(
                    model="gpt-4",
                    response_model=SyntheticQuery,
                    messages=[
                        {"role": "system", "content": f"Generate a {persona} query with {intent} intent"},
                        {"role": "user", "content": f"Document: {doc}"}
                    ]
                )
                queries.append(query)
    
    return queries

2. Apply Topic Modeling

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

def discover_query_topics(queries: List[str]):
    """Discover natural clusters in user queries"""
    
    # Configure embedding model
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Configure dimensionality reduction
    umap_model = UMAP(
        n_neighbors=15,
        n_components=5,
        min_dist=0.0,
        metric='cosine'
    )
    
    # Configure clustering
    hdbscan_model = HDBSCAN(
        min_cluster_size=10,
        metric='euclidean',
        cluster_selection_method='eom'
    )
    
    # Create and fit topic model
    topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        verbose=True
    )
    
    topics, probs = topic_model.fit_transform(queries)
    
    return topic_model, topics

3. Analyze Satisfaction by Topic

import pandas as pd
import matplotlib.pyplot as plt

def analyze_satisfaction_by_topic(queries_df, topics):
    """Identify problematic query patterns"""
    
    # Add topics to dataframe
    queries_df['topic'] = topics
    
    # Calculate metrics per topic
    topic_analysis = queries_df.groupby('topic').agg({
        'satisfaction_score': ['mean', 'count'],
        'query': 'count'
    }).round(2)
    
    # Identify high-impact improvement areas
    # (high volume + low satisfaction)
    topic_analysis['priority'] = (
        topic_analysis['query']['count'] / topic_analysis['satisfaction_score']['mean']
    )
    
    # Sort by priority
    high_priority = topic_analysis.sort_values('priority', ascending=False).head(10)
    
    return high_priority

4. Build Query Classifier

from pydantic import BaseModel, Field
from typing import Literal
import yaml

# Define classification taxonomy
class QueryCategory(BaseModel):
    primary_intent: Literal["informational", "navigational", "transactional"]
    topic_area: str = Field(description="Main topic of the query")
    complexity: Literal["simple", "moderate", "complex"]
    requires_context: bool = Field(description="Whether query needs conversation history")

def classify_query(query: str, taxonomy: dict) -> QueryCategory:
    """Classify incoming queries for monitoring"""
    
    client = instructor.from_openai(OpenAI())
    
    classification = client.chat.completions.create(
        model="gpt-4",
        response_model=QueryCategory,
        messages=[
            {"role": "system", "content": f"Classify queries according to: {yaml.dump(taxonomy)}"},
            {"role": "user", "content": f"Query: {query}"}
        ]
    )
    
    return classification

Visualization and Analysis

Topic Distribution

def visualize_topics(topic_model, queries):
    """Visualize topic relationships and distributions"""
    
    # Topic representation
    topic_info = topic_model.get_topic_info()
    print(f"Discovered {len(topic_info)} topics")
    
    # Visualize topic distances
    topic_model.visualize_topics()
    
    # Visualize topic hierarchy
    topic_model.visualize_hierarchy()
    
    # Document clustering
    topic_model.visualize_documents(queries)

Satisfaction Heatmap

import seaborn as sns

def create_satisfaction_heatmap(queries_df):
    """Visualize satisfaction across topics and personas"""
    
    pivot_table = queries_df.pivot_table(
        values='satisfaction_score',
        index='topic',
        columns='persona',
        aggfunc='mean'
    )
    
    plt.figure(figsize=(12, 8))
    sns.heatmap(pivot_table, annot=True, cmap='RdYlGn', center=3.0)
    plt.title('Average Satisfaction by Topic and Persona')
    plt.show()

Production Monitoring System

class QueryMonitor:
    """Monitor and classify queries in production"""
    
    def __init__(self, topic_model, classifier, taxonomy):
        self.topic_model = topic_model
        self.classifier = classifier
        self.taxonomy = taxonomy
        self.query_log = []
    
    def process_query(self, query: str, satisfaction: float = None):
        """Process incoming query and log for analysis"""
        
        # Classify query
        category = self.classifier(query, self.taxonomy)
        
        # Assign topic (if topic model is trained)
        topic = self.topic_model.transform([query])[0][0]
        
        # Log for analysis
        self.query_log.append({
            'query': query,
            'topic': topic,
            'category': category.dict(),
            'satisfaction': satisfaction,
            'timestamp': pd.Timestamp.now()
        })
    
    def get_insights(self):
        """Generate insights from collected data"""
        df = pd.DataFrame(self.query_log)
        
        # Identify trending topics
        recent = df[df['timestamp'] > pd.Timestamp.now() - pd.Timedelta(days=7)]
        trending = recent.groupby('topic').size().sort_values(ascending=False)
        
        # Identify problem areas
        if 'satisfaction' in df.columns:
            problems = df.groupby('topic')['satisfaction'].mean().sort_values()
        
        return {'trending': trending, 'problems': problems}

Expected Outcomes

After implementing query understanding, you'll have:

  • Clear visibility into most common query patterns
  • Identification of high-volume, low-satisfaction areas
  • Data-driven priorities for RAG improvements
  • Production monitoring system for ongoing insights
  • Understanding of different user personas and their needs

Common Patterns to Watch For

High-Impact Issues

  1. Ambiguous Queries: Users asking vague questions that need clarification
  2. Out-of-Scope Queries: Questions your system can't answer
  3. Terminology Mismatches: Users using different terms than your documents
  4. Multi-Step Queries: Complex questions requiring multiple retrievals

User Behavior Signals

  • Repetitive queries: User dissatisfaction with initial responses
  • Query reformulation: Users trying different phrasings for same need
  • Increasing specificity: Users adding more details after poor results

Next Steps

  • Generate synthetic query dataset representing your users
  • Apply BERTopic to discover natural query clusters
  • Analyze correlation between topics and satisfaction
  • Build classification system for production monitoring
  • Use insights to prioritize retrieval improvements

Additional Resources