RAG Implementation Guide

Overview

The "Context Window" is the maximum amount of text (measured in tokens) an LLM can process in a single request. While modern models boast 128k+ windows, stuffing them full is bad for cost, latency, and accuracy (the "Lost in the Middle" phenomenon). Effective management is key to high-performance RAG.

The "Lost in the Middle" Phenomenon

Research shows that LLMs are great at retrieving information from the beginning and end of the prompt, but performance degrades significantly for information buried in the middle.

Implication: Simply increasing top_k from 5 to 50 to "catch everything" often hurts performance.

Strategies for Context Management

1. Smart Chunking

Garbage in, garbage out. If your chunks are cut mid-sentence, the LLM loses context.

Recursive Character Splitter: Split by paragraphs, then sentences, then words.
Semantic Chunking: Break text based on semantic shifts rather than fixed character counts.
Sentence Window Retrieval: Embed single sentences for search, but retrieve the surrounding 5 sentences for context.

2. Context Compression

Reduce the noise before it hits the LLM.

LLingua / LongLLMLingua: Techniques to compress prompts by removing non-essential tokens (stopwords, redundant phrases) while preserving meaning.
Summarization Chain: Retrieve 10 documents -> Summarize each independently -> Feed summaries to final answer generation.

3. Re-ranking & Re-ordering

Optimize the layout of the prompt.

Re-ranking: Use a Cross-Encoder (e.g., ms-marco-MiniLM) to score the retrieved documents. Keep only the high-scoring ones.
Re-ordering: Place the most relevant documents at the beginning and end of the context block, placing less relevant ones in the middle.

4. Sliding Window (For Long Documents)

If you must process a massive document that exceeds the window (e.g., a book):

Approach:
1. Chunk document into windows (e.g., 4k tokens) with overlap (e.g., 500 tokens).
2. Process each window independently (e.g., "Extract key entities").
3. Aggregate results.

Token Counting Implementation

Always count tokens before sending requests to avoid API errors.

import tiktoken

def count_tokens(text, model="gpt-4"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def truncate_context(context_chunks, max_tokens=4000):
    current_tokens = 0
    selected_chunks = []
    
    for chunk in context_chunks:
        chunk_tokens = count_tokens(chunk)
        if current_tokens + chunk_tokens > max_tokens:
            break
        selected_chunks.append(chunk)
        current_tokens += chunk_tokens
        
    return selected_chunks

Choosing the Right Window Size

Model	Window Size	Best For
Llama 3 8B	8k	Fast, simple RAG tasks.
GPT-4o	128k	Complex reasoning over many docs.
Claude 3.5 Sonnet	200k	Massive document analysis, coding.
Gemini 1.5 Pro	1M+	"Needle in a haystack" across entire codebases/books.

Rule of Thumb: Just because you have 128k tokens doesn't mean you should use them. Aim for the minimum sufficient context.

The Reality of Long Context Windows

From Production: "Just because Amazon could theoretically score every product in their inventory for each user, they choose not to because each 100ms of latency costs them 1% in revenue. We still need to make choices about what to include in context."

The Battery Analogy: iPhone batteries get more powerful every year, but battery life stays the same because we build more power-hungry apps. Similarly, as context windows grow, we'll find ways to use that additional capacity (e.g., for reasoning, few-shot examples, or history) rather than just dumping entire databases into the prompt.

Why RAG survives 1M+ context windows:

Latency: Processing 1M tokens takes seconds/minutes. Search takes milliseconds.
Cost: Input tokens cost money. RAG filters the noise.
Focus: Models still hallucinate less when given specific, relevant context.

Common Questions

"Does Gemini 1.5 Pro's 1M context window make RAG obsolete?"

No. It changes what you retrieve, not if you retrieve.

Old RAG: Retrieve 5 chunks of 500 tokens.
New RAG: Retrieve 5 whole documents of 50 pages each.
Benefit: You no longer need perfect chunking, but you still need retrieval to find the right documents.

"How do I handle documents that change over time?"

Include dates in the context.

If you have HR policies for 2023 and 2024, put both in the context with explicit dates.
Let the LLM reason: "According to the 2024 policy..."
Don't rely solely on the vector DB to filter by date unless it's a hard constraint.

Next Steps

RAG Cost Optimization - Save money by managing tokens.
Deploying RAG to Production - System architecture.