Agentic RAG: Tool Selection and Orchestration

Build multi-tool RAG systems with systematic evaluation and improvement

Overview

Modern AI applications rarely rely on retrieval alone—they need to search databases, call APIs, execute code, and combine multiple information sources. This guide applies the same systematic improvement methodology to tool selection that you've learned for retrieval optimization.

The approach follows the evaluation flywheel:

  1. Establish metrics to measure tool selection quality
  2. Generate test cases that expose weaknesses
  3. Implement improvements (system prompts, few-shot examples)
  4. Measure gains and iterate

Why Tool Orchestration Matters

  • Expand Capabilities: Handle queries requiring multiple data sources
  • Complex Workflows: Coordinate multi-step operations
  • Specialization: Use best tool for each subtask
  • Real-World Applications: Most production systems need more than just retrieval

Key Concepts

Tool Selection Metrics

Adapt precision and recall for measuring correct tool choice:

  • Precision: Of all tools selected, how many were correct?

    • Formula: (correctly selected tools) / (total selected tools)
    • Penalizes over-selection
  • Recall: Of all needed tools, how many were selected?

    • Formula: (correctly selected tools) / (total needed tools)
    • Penalizes under-selection
  • F1 Score: Harmonic mean balancing precision and recall

    • Formula: 2 * (precision * recall) / (precision + recall)

Execution Strategies

Parallel Execution: Run multiple tools simultaneously

  • Faster for independent operations
  • More complex error handling
  • Higher token usage

Sequential Execution: Chain tools one after another

  • Easier to manage dependencies
  • Lower token usage
  • Better for context-dependent tasks

Implementation Guide

1. Define Tools with Pydantic

from pydantic import BaseModel, Field
from typing import Literal, List
from enum import Enum

class ToolName(str, Enum):
    VECTOR_SEARCH = "vector_search"
    SQL_QUERY = "sql_query"
    WEB_SEARCH = "web_search"
    CALCULATOR = "calculator"
    API_CALL = "api_call"

class ToolSelection(BaseModel):
    tools: List[ToolName] = Field(description="Tools needed for this query")
    reasoning: str = Field(description="Why these tools were chosen")
    execution_order: Literal["parallel", "sequential"] = Field(
        description="How to execute the tools"
    )

# Tool definitions
TOOL_DEFINITIONS = {
    "vector_search": {
        "description": "Search internal documentation using semantic similarity",
        "use_when": "User asks about company policies, product docs, or internal knowledge"
    },
    "sql_query": {
        "description": "Query structured database for user data, orders, analytics",
        "use_when": "User asks about their account, order status, or needs data aggregation"
    },
    "web_search": {
        "description": "Search the internet for current information",
        "use_when": "Query requires recent information or external knowledge"
    },
    "calculator": {
        "description": "Perform mathematical calculations",
        "use_when": "User needs mathematical operations or conversions"
    }
}

2. Implement Tool Selector

import instructor from openai import OpenAI

client = instructor.from_openai(OpenAI())

def select_tools(query: str, tool_definitions: dict) -> ToolSelection:
    """Select appropriate tools for a given query"""
    
    # Format tool descriptions
    tools_desc = "\n".join([
        f"{name}: {info['description']} - Use when: {info['use_when']}"
        for name, info in tool_definitions.items()
    ])
    
    selection = client.chat.completions.create(
        model="gpt-4",
        response_model=ToolSelection,
        messages=[
            {"role": "system", "content": f"""
                You are a tool selection expert. Choose the appropriate tools for each query.
                
                Available tools:
                {tools_desc}
                
                Select only the tools actually needed. Avoid over-selection.
            """},
            {"role": "user", "content": f"Query: {query}"}
        ]
    )
    
    return selection

3. Execute Tools

import asyncio
from typing import Dict, Any

async def execute_tool(tool_name: str, query: str) -> Dict[str, Any]:
    """Execute a single tool"""
    
    if tool_name == "vector_search":
        return await vector_search(query)
    elif tool_name == "sql_query":
        return await sql_query(query)
    elif tool_name == "web_search":
        return await web_search(query)
    elif tool_name == "calculator":
        return await calculator(query)
    
    raise ValueError(f"Unknown tool: {tool_name}")

async def execute_parallel(tools: List[str], query: str):
    """Execute tools in parallel"""
    tasks = [execute_tool(tool, query) for tool in tools]
    results = await asyncio.gather(*tasks)
    return dict(zip(tools, results))

async def execute_sequential(tools: List[str], query: str):
    """Execute tools sequentially"""
    results = {}
    context = query
    
    for tool in tools:
        result = await execute_tool(tool, context)
        results[tool] = result
        # Update context with previous results
        context = f"{query}\n\nPrevious results: {result}"
    
    return results

4. Evaluation Framework

from dataclasses import dataclass
from typing import Set

@dataclass
class ToolSelectionExample:
    query: str
    expected_tools: Set[str]
    selected_tools: Set[str]

def calculate_metrics(examples: List[ToolSelectionExample]):
    """Calculate precision, recall, F1 for tool selection"""
    
    total_precision = 0
    total_recall = 0
    
    for example in examples:
        expected = example.expected_tools
        selected = example.selected_tools
        
        if len(selected) == 0:
            precision = 0
        else:
            correct = expected.intersection(selected)
            precision = len(correct) / len(selected)
        
        if len(expected) == 0:
            recall = 1  # No tools needed and none selected
        else:
            correct = expected.intersection(selected)
            recall = len(correct) / len(expected)
        
        total_precision += precision
        total_recall += recall
    
    avg_precision = total_precision / len(examples)
    avg_recall = total_recall / len(examples)
    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
    
    return {
        'precision': avg_precision,
        'recall': avg_recall,
        'f1': f1
    }

5. Generate Test Cases

class TestQuery(BaseModel):
    query: str
    expected_tools: List[ToolName]
    difficulty: Literal["easy", "medium", "hard"]
    failure_mode: str

def generate_test_cases(tool_definitions: dict, num_cases: int = 100):
    """Generate synthetic test cases targeting failure modes"""
    
    test_cases = client.chat.completions.create(
        model="gpt-4",
        response_model=List[TestQuery],
        messages=[
            {"role": "system", "content": f"""
                Generate {num_cases} test queries covering these failure modes:
                1. Ambiguous queries (could use multiple tools)
                2. Multi-step queries (need sequential execution)
                3. Queries needing no tools (conversational)
                4. Complex queries requireing 3+ tools
                5. Context-dependent queries
                
                Tools available: {list(tool_definitions.keys())}
            """},
            {"role": "user", "content": "Generate diverse test cases"}
        ]
    )
    
   return test_cases

Improvement Strategies

1. System Prompts

IMPROVED_SYSTEM_PROMPT = """
You are a tool selection expert. Your job is to identify the minimal set of tools needed.

CRITICAL RULES:
1. Only select tools actually required - DO NOT over-select
2. Consider if the query can be answered without tools (conversational)
3. For multi-step queries, select tools in the order they'll be used
4. If a query is ambiguous, select the most likely tool set

Tool Selection Guidelines:
- vector_search: ONLY for internal company knowledge
- sql_query: ONLY for structured data queries (orders, users, analytics)
- web_search: ONLY for current events or external information
- calculator: ONLY for mathematical operations

Common Mistakes to Avoid:
- Don't use vector_search for general knowledge questions
- Don't use sql_query for questions that don't need database access
- Don't select multiple tools when one suffices
"""

2. Few-Shot Examples

FEW_SHOT_EXAMPLES = [
    {
        "query": "What's the status of order #12345?",
        "tools": ["sql_query"],
        "reasoning": "Needs database access for order info"
    },
    {
        "query": "What is our return policy?",
        "tools": ["vector_search"],
        "reasoning": "Internal policy document retrieval"
    },
    {
        "query": "Find blue shirts under $50 from our catalog",
        "tools": ["vector_search"],
        "reasoning": "Product search with filters"
    },
    {
        "query": "What's trending on social media about our brand?",
        "tools": ["web_search"],
        "reasoning": "Current external information"
    },
    {
        "query": "Thanks for your help!",
        "tools": [],
        "reasoning": "Conversational - no tools needed"
    }
]

def select_tools_with_examples(query: str):
    """Tool selection with few-shot examples"""
    
    # Format examples
    examples_text = "\n\n".join([
        f"Query: {ex['query']}\nTools: {ex['tools']}\nReasoning: {ex['reasoning']}"
        for ex in FEW_SHOT_EXAMPLES
    ])
    
    selection = client.chat.completions.create(
        model="gpt-4",
        response_model=ToolSelection,
        messages=[
            {"role": "system", "content": IMPROVED_SYSTEM_PROMPT},
            {"role": "user", "content": f"Examples:\n{examples_text}\n\nNow classify: {query}"}
        ]
    )
    
    return selection

Performance Analysis

Per-Tool Metrics

def analyze_per_tool_performance(examples):
    """Identify which tools are selected incorrectly"""
    
    tool_stats = {}
    
    for example in examples:
        for tool in example.expected_tools:
            if tool not in tool_stats:
                tool_stats[tool] = {'correct': 0, 'missed': 0, 'false_positive': 0}
            
            if tool in example.selected_tools:
                tool_stats[tool]['correct'] += 1
            else:
                tool_stats[tool]['missed'] += 1
        
        for tool in example.selected_tools:
            if tool not in example.expected_tools:
                if tool not in tool_stats:
                    tool_stats[tool] = {'correct': 0, 'missed': 0, 'false_positive': 0}
                tool_stats[tool]['false_positive'] += 1
    
    # Calculate per-tool precision/recall
    for tool, stats in tool_stats.items():
        precision = stats['correct'] / (stats['correct'] + stats['false_positive']) if (stats['correct'] + stats['false_positive']) > 0 else 0
        recall = stats['correct'] / (stats['correct'] + stats['missed']) if (stats['correct'] + stats['missed']) > 0 else 0
        
        print(f"{tool}: Precision={precision:.2%}, Recall={recall:.2%}")

Expected Improvements

With systematic optimization:

  • Baseline: ~40-50% F1 score
  • With system prompts: ~60-70% F1 score (+50% improvement)
  • With few-shot examples: ~75-85% F1 score (+75-100% improvement)

Common Issues

Over-Selection

Problem: System selects too many tools

Solution: Emphasize minimalism in system prompt, add negative examples

Under-Selection

Problem: System misses necessary tools

Solution: Add few-shot examples covering multi-tool scenarios

Context Confusion

Problem: Wrong tools for ambiguous queries

Solution: Add clarification step or provide more context in system prompt

Advanced Pattern: Plan-then-Execute

From Production: "Instead of having the language model immediately execute functions one at a time, prompt it to show the entire plan to the user and potentially ask for confirmation."

The Workflow:

  1. User: "Book me a flight to NYC next Tuesday."
  2. Agent (Planning): "I'll check flights for next Tuesday. Do you have a preferred airline or time?" (Returns plan, doesn't call API yet).
  3. User: "United, morning."
  4. Agent (Execution): Calls search_flights(airline='United', time='morning').

Benefits:

  • Safety: User confirms actions before they happen.
  • Data: User-approved plans become excellent training data for fine-tuning.
  • Accuracy: Separating planning from execution reduces hallucination.

Common Questions

"How do I handle complex conversations?"

Use Finite State Machines (FSMs).

  • Don't just rely on one giant prompt.
  • Define states: Introduction -> Data Collection -> Confirmation -> Execution.
  • Use different system prompts and available tools for each state.
  • Example: In Data Collection, the only available tool might be validate_input(). In Execution, it's submit_order().

"How do I generate synthetic data for agents?"

Generate synthetic plans, not just questions.

  • Create synthetic user queries.
  • Have a strong model (GPT-4) generate the "ideal plan" (sequence of tool calls).
  • Verify that executing this plan yields the correct result.
  • Use these (Query -> Plan) pairs to fine-tune smaller models.

Next Steps

  • Define your tool set and create clear descriptions
  • Generate test cases covering failure modes
  • Establish baseline metrics
  • Iterate on system prompts and few-shot examples
  • Monitor per-tool performance to identify specific weaknesses

Additional Resources