Context Window Management
Techniques for handling token limits, sliding windows, and optimizing context utilization in RAG.
Overview
The "Context Window" is the maximum amount of text (measured in tokens) an LLM can process in a single request. While modern models boast 128k+ windows, stuffing them full is bad for cost, latency, and accuracy (the "Lost in the Middle" phenomenon). Effective management is key to high-performance RAG.
The "Lost in the Middle" Phenomenon
Research shows that LLMs are great at retrieving information from the beginning and end of the prompt, but performance degrades significantly for information buried in the middle.
Implication: Simply increasing top_k from 5 to 50 to "catch everything" often hurts performance.
Strategies for Context Management
1. Smart Chunking
Garbage in, garbage out. If your chunks are cut mid-sentence, the LLM loses context.
- Recursive Character Splitter: Split by paragraphs, then sentences, then words.
- Semantic Chunking: Break text based on semantic shifts rather than fixed character counts.
- Sentence Window Retrieval: Embed single sentences for search, but retrieve the surrounding 5 sentences for context.
2. Context Compression
Reduce the noise before it hits the LLM.
- LLingua / LongLLMLingua: Techniques to compress prompts by removing non-essential tokens (stopwords, redundant phrases) while preserving meaning.
- Summarization Chain: Retrieve 10 documents -> Summarize each independently -> Feed summaries to final answer generation.
3. Re-ranking & Re-ordering
Optimize the layout of the prompt.
- Re-ranking: Use a Cross-Encoder (e.g.,
ms-marco-MiniLM) to score the retrieved documents. Keep only the high-scoring ones. - Re-ordering: Place the most relevant documents at the beginning and end of the context block, placing less relevant ones in the middle.
4. Sliding Window (For Long Documents)
If you must process a massive document that exceeds the window (e.g., a book):
- Approach:
- Chunk document into windows (e.g., 4k tokens) with overlap (e.g., 500 tokens).
- Process each window independently (e.g., "Extract key entities").
- Aggregate results.
Token Counting Implementation
Always count tokens before sending requests to avoid API errors.
import tiktoken
def count_tokens(text, model="gpt-4"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def truncate_context(context_chunks, max_tokens=4000):
current_tokens = 0
selected_chunks = []
for chunk in context_chunks:
chunk_tokens = count_tokens(chunk)
if current_tokens + chunk_tokens > max_tokens:
break
selected_chunks.append(chunk)
current_tokens += chunk_tokens
return selected_chunks
Choosing the Right Window Size
| Model | Window Size | Best For |
|---|---|---|
| Llama 3 8B | 8k | Fast, simple RAG tasks. |
| GPT-4o | 128k | Complex reasoning over many docs. |
| Claude 3.5 Sonnet | 200k | Massive document analysis, coding. |
| Gemini 1.5 Pro | 1M+ | "Needle in a haystack" across entire codebases/books. |
Rule of Thumb: Just because you have 128k tokens doesn't mean you should use them. Aim for the minimum sufficient context.
The Reality of Long Context Windows
From Production: "Just because Amazon could theoretically score every product in their inventory for each user, they choose not to because each 100ms of latency costs them 1% in revenue. We still need to make choices about what to include in context."
The Battery Analogy: iPhone batteries get more powerful every year, but battery life stays the same because we build more power-hungry apps. Similarly, as context windows grow, we'll find ways to use that additional capacity (e.g., for reasoning, few-shot examples, or history) rather than just dumping entire databases into the prompt.
Why RAG survives 1M+ context windows:
- Latency: Processing 1M tokens takes seconds/minutes. Search takes milliseconds.
- Cost: Input tokens cost money. RAG filters the noise.
- Focus: Models still hallucinate less when given specific, relevant context.
Common Questions
"Does Gemini 1.5 Pro's 1M context window make RAG obsolete?"
No. It changes what you retrieve, not if you retrieve.
- Old RAG: Retrieve 5 chunks of 500 tokens.
- New RAG: Retrieve 5 whole documents of 50 pages each.
- Benefit: You no longer need perfect chunking, but you still need retrieval to find the right documents.
"How do I handle documents that change over time?"
Include dates in the context.
- If you have HR policies for 2023 and 2024, put both in the context with explicit dates.
- Let the LLM reason: "According to the 2024 policy..."
- Don't rely solely on the vector DB to filter by date unless it's a hard constraint.
Next Steps
- RAG Cost Optimization - Save money by managing tokens.
- Deploying RAG to Production - System architecture.