Agentic RAG: Tool Selection and Orchestration
Build multi-tool RAG systems with systematic evaluation and improvement
Overview
Modern AI applications rarely rely on retrieval alone—they need to search databases, call APIs, execute code, and combine multiple information sources. This guide applies the same systematic improvement methodology to tool selection that you've learned for retrieval optimization.
The approach follows the evaluation flywheel:
- Establish metrics to measure tool selection quality
- Generate test cases that expose weaknesses
- Implement improvements (system prompts, few-shot examples)
- Measure gains and iterate
Why Tool Orchestration Matters
- Expand Capabilities: Handle queries requiring multiple data sources
- Complex Workflows: Coordinate multi-step operations
- Specialization: Use best tool for each subtask
- Real-World Applications: Most production systems need more than just retrieval
Key Concepts
Tool Selection Metrics
Adapt precision and recall for measuring correct tool choice:
-
Precision: Of all tools selected, how many were correct?
- Formula: (correctly selected tools) / (total selected tools)
- Penalizes over-selection
-
Recall: Of all needed tools, how many were selected?
- Formula: (correctly selected tools) / (total needed tools)
- Penalizes under-selection
-
F1 Score: Harmonic mean balancing precision and recall
- Formula: 2 * (precision * recall) / (precision + recall)
Execution Strategies
Parallel Execution: Run multiple tools simultaneously
- Faster for independent operations
- More complex error handling
- Higher token usage
Sequential Execution: Chain tools one after another
- Easier to manage dependencies
- Lower token usage
- Better for context-dependent tasks
Implementation Guide
1. Define Tools with Pydantic
from pydantic import BaseModel, Field
from typing import Literal, List
from enum import Enum
class ToolName(str, Enum):
VECTOR_SEARCH = "vector_search"
SQL_QUERY = "sql_query"
WEB_SEARCH = "web_search"
CALCULATOR = "calculator"
API_CALL = "api_call"
class ToolSelection(BaseModel):
tools: List[ToolName] = Field(description="Tools needed for this query")
reasoning: str = Field(description="Why these tools were chosen")
execution_order: Literal["parallel", "sequential"] = Field(
description="How to execute the tools"
)
# Tool definitions
TOOL_DEFINITIONS = {
"vector_search": {
"description": "Search internal documentation using semantic similarity",
"use_when": "User asks about company policies, product docs, or internal knowledge"
},
"sql_query": {
"description": "Query structured database for user data, orders, analytics",
"use_when": "User asks about their account, order status, or needs data aggregation"
},
"web_search": {
"description": "Search the internet for current information",
"use_when": "Query requires recent information or external knowledge"
},
"calculator": {
"description": "Perform mathematical calculations",
"use_when": "User needs mathematical operations or conversions"
}
}
2. Implement Tool Selector
import instructor from openai import OpenAI
client = instructor.from_openai(OpenAI())
def select_tools(query: str, tool_definitions: dict) -> ToolSelection:
"""Select appropriate tools for a given query"""
# Format tool descriptions
tools_desc = "\n".join([
f"{name}: {info['description']} - Use when: {info['use_when']}"
for name, info in tool_definitions.items()
])
selection = client.chat.completions.create(
model="gpt-4",
response_model=ToolSelection,
messages=[
{"role": "system", "content": f"""
You are a tool selection expert. Choose the appropriate tools for each query.
Available tools:
{tools_desc}
Select only the tools actually needed. Avoid over-selection.
"""},
{"role": "user", "content": f"Query: {query}"}
]
)
return selection
3. Execute Tools
import asyncio
from typing import Dict, Any
async def execute_tool(tool_name: str, query: str) -> Dict[str, Any]:
"""Execute a single tool"""
if tool_name == "vector_search":
return await vector_search(query)
elif tool_name == "sql_query":
return await sql_query(query)
elif tool_name == "web_search":
return await web_search(query)
elif tool_name == "calculator":
return await calculator(query)
raise ValueError(f"Unknown tool: {tool_name}")
async def execute_parallel(tools: List[str], query: str):
"""Execute tools in parallel"""
tasks = [execute_tool(tool, query) for tool in tools]
results = await asyncio.gather(*tasks)
return dict(zip(tools, results))
async def execute_sequential(tools: List[str], query: str):
"""Execute tools sequentially"""
results = {}
context = query
for tool in tools:
result = await execute_tool(tool, context)
results[tool] = result
# Update context with previous results
context = f"{query}\n\nPrevious results: {result}"
return results
4. Evaluation Framework
from dataclasses import dataclass
from typing import Set
@dataclass
class ToolSelectionExample:
query: str
expected_tools: Set[str]
selected_tools: Set[str]
def calculate_metrics(examples: List[ToolSelectionExample]):
"""Calculate precision, recall, F1 for tool selection"""
total_precision = 0
total_recall = 0
for example in examples:
expected = example.expected_tools
selected = example.selected_tools
if len(selected) == 0:
precision = 0
else:
correct = expected.intersection(selected)
precision = len(correct) / len(selected)
if len(expected) == 0:
recall = 1 # No tools needed and none selected
else:
correct = expected.intersection(selected)
recall = len(correct) / len(expected)
total_precision += precision
total_recall += recall
avg_precision = total_precision / len(examples)
avg_recall = total_recall / len(examples)
f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
return {
'precision': avg_precision,
'recall': avg_recall,
'f1': f1
}
5. Generate Test Cases
class TestQuery(BaseModel):
query: str
expected_tools: List[ToolName]
difficulty: Literal["easy", "medium", "hard"]
failure_mode: str
def generate_test_cases(tool_definitions: dict, num_cases: int = 100):
"""Generate synthetic test cases targeting failure modes"""
test_cases = client.chat.completions.create(
model="gpt-4",
response_model=List[TestQuery],
messages=[
{"role": "system", "content": f"""
Generate {num_cases} test queries covering these failure modes:
1. Ambiguous queries (could use multiple tools)
2. Multi-step queries (need sequential execution)
3. Queries needing no tools (conversational)
4. Complex queries requireing 3+ tools
5. Context-dependent queries
Tools available: {list(tool_definitions.keys())}
"""},
{"role": "user", "content": "Generate diverse test cases"}
]
)
return test_cases
Improvement Strategies
1. System Prompts
IMPROVED_SYSTEM_PROMPT = """
You are a tool selection expert. Your job is to identify the minimal set of tools needed.
CRITICAL RULES:
1. Only select tools actually required - DO NOT over-select
2. Consider if the query can be answered without tools (conversational)
3. For multi-step queries, select tools in the order they'll be used
4. If a query is ambiguous, select the most likely tool set
Tool Selection Guidelines:
- vector_search: ONLY for internal company knowledge
- sql_query: ONLY for structured data queries (orders, users, analytics)
- web_search: ONLY for current events or external information
- calculator: ONLY for mathematical operations
Common Mistakes to Avoid:
- Don't use vector_search for general knowledge questions
- Don't use sql_query for questions that don't need database access
- Don't select multiple tools when one suffices
"""
2. Few-Shot Examples
FEW_SHOT_EXAMPLES = [
{
"query": "What's the status of order #12345?",
"tools": ["sql_query"],
"reasoning": "Needs database access for order info"
},
{
"query": "What is our return policy?",
"tools": ["vector_search"],
"reasoning": "Internal policy document retrieval"
},
{
"query": "Find blue shirts under $50 from our catalog",
"tools": ["vector_search"],
"reasoning": "Product search with filters"
},
{
"query": "What's trending on social media about our brand?",
"tools": ["web_search"],
"reasoning": "Current external information"
},
{
"query": "Thanks for your help!",
"tools": [],
"reasoning": "Conversational - no tools needed"
}
]
def select_tools_with_examples(query: str):
"""Tool selection with few-shot examples"""
# Format examples
examples_text = "\n\n".join([
f"Query: {ex['query']}\nTools: {ex['tools']}\nReasoning: {ex['reasoning']}"
for ex in FEW_SHOT_EXAMPLES
])
selection = client.chat.completions.create(
model="gpt-4",
response_model=ToolSelection,
messages=[
{"role": "system", "content": IMPROVED_SYSTEM_PROMPT},
{"role": "user", "content": f"Examples:\n{examples_text}\n\nNow classify: {query}"}
]
)
return selection
Performance Analysis
Per-Tool Metrics
def analyze_per_tool_performance(examples):
"""Identify which tools are selected incorrectly"""
tool_stats = {}
for example in examples:
for tool in example.expected_tools:
if tool not in tool_stats:
tool_stats[tool] = {'correct': 0, 'missed': 0, 'false_positive': 0}
if tool in example.selected_tools:
tool_stats[tool]['correct'] += 1
else:
tool_stats[tool]['missed'] += 1
for tool in example.selected_tools:
if tool not in example.expected_tools:
if tool not in tool_stats:
tool_stats[tool] = {'correct': 0, 'missed': 0, 'false_positive': 0}
tool_stats[tool]['false_positive'] += 1
# Calculate per-tool precision/recall
for tool, stats in tool_stats.items():
precision = stats['correct'] / (stats['correct'] + stats['false_positive']) if (stats['correct'] + stats['false_positive']) > 0 else 0
recall = stats['correct'] / (stats['correct'] + stats['missed']) if (stats['correct'] + stats['missed']) > 0 else 0
print(f"{tool}: Precision={precision:.2%}, Recall={recall:.2%}")
Expected Improvements
With systematic optimization:
- Baseline: ~40-50% F1 score
- With system prompts: ~60-70% F1 score (+50% improvement)
- With few-shot examples: ~75-85% F1 score (+75-100% improvement)
Common Issues
Over-Selection
Problem: System selects too many tools
Solution: Emphasize minimalism in system prompt, add negative examples
Under-Selection
Problem: System misses necessary tools
Solution: Add few-shot examples covering multi-tool scenarios
Context Confusion
Problem: Wrong tools for ambiguous queries
Solution: Add clarification step or provide more context in system prompt
Advanced Pattern: Plan-then-Execute
From Production: "Instead of having the language model immediately execute functions one at a time, prompt it to show the entire plan to the user and potentially ask for confirmation."
The Workflow:
- User: "Book me a flight to NYC next Tuesday."
- Agent (Planning): "I'll check flights for next Tuesday. Do you have a preferred airline or time?" (Returns plan, doesn't call API yet).
- User: "United, morning."
- Agent (Execution): Calls
search_flights(airline='United', time='morning').
Benefits:
- Safety: User confirms actions before they happen.
- Data: User-approved plans become excellent training data for fine-tuning.
- Accuracy: Separating planning from execution reduces hallucination.
Common Questions
"How do I handle complex conversations?"
Use Finite State Machines (FSMs).
- Don't just rely on one giant prompt.
- Define states:
Introduction->Data Collection->Confirmation->Execution. - Use different system prompts and available tools for each state.
- Example: In
Data Collection, the only available tool might bevalidate_input(). InExecution, it'ssubmit_order().
"How do I generate synthetic data for agents?"
Generate synthetic plans, not just questions.
- Create synthetic user queries.
- Have a strong model (GPT-4) generate the "ideal plan" (sequence of tool calls).
- Verify that executing this plan yields the correct result.
- Use these (Query -> Plan) pairs to fine-tune smaller models.
Next Steps
- Define your tool set and create clear descriptions
- Generate test cases covering failure modes
- Establish baseline metrics
- Iterate on system prompts and few-shot examples
- Monitor per-tool performance to identify specific weaknesses