Chunking Strategies
Split documents into optimal chunks for embedding and retrieval in RAG systems.
Overview
Chunking is the process of splitting documents into smaller pieces before embedding. Chunk size is the single most important hyperparameter in RAG systems.
Why Chunking Matters
- Too small: Loses context, poor retrieval
- Too large: Exceeds context window, dilutes relevance
- Just right: Balances context and specificity
Fixed-Size Chunking
Simplest approach: split by character or token count.
def chunk_by_tokens(text, chunk_size=512, overlap=50):
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i:i + chunk_size]
chunk_text = enc.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
Pros: Simple, predictable Cons: Breaks mid-sentence, ignores document structure
Recursive Character Splitting
Split by natural boundaries (paragraphs, sentences).
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)
Pros: Respects natural boundaries Cons: Still arbitrary, ignores semantics
Semantic Chunking
Split based on meaning, not length.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_chunk(text, threshold=0.5):
sentences = text.split('. ')
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Calculate similarity with previous sentence
similarity = np.dot(embeddings[i], embeddings[i-1])
if similarity < threshold:
# Start new chunk
chunks.append('. '.join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append('. '.join(current_chunk))
return chunks
Pros: Semantically coherent chunks Cons: Slower, variable chunk sizes
Document-Aware Chunking
Respect document structure (headings, sections).
def chunk_by_sections(markdown_text):
import re
# Split by headings
sections = re.split(r'\n(#{1,6} .+)\n', markdown_text)
chunks = []
current_heading = ""
for i, section in enumerate(sections):
if section.startswith('#'):
current_heading = section
else:
# Include heading in chunk for context
chunk = f"{current_heading}\n{section}"
chunks.append(chunk)
return chunks
Chunk Size Guidelines
| Use Case | Chunk Size | Overlap | Reasoning |
|---|---|---|---|
| Q&A | 256-512 tokens | 50-100 | Short, specific answers |
| Long-form | 1000-1500 tokens | 200-300 | Need more context |
| Code | 500-1000 tokens | 100-200 | Function/class level |
| Legal | 1500-2000 tokens | 300-400 | Preserve clauses |
Overlap Strategy
def chunk_with_overlap(text, chunk_size=500, overlap=100):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
Why overlap?
- Prevents information loss at boundaries
- Improves recall for edge cases
- Typical overlap: 10-20% of chunk size
Advanced: Parent Document Retrieval
Store small chunks for retrieval, large chunks for context.
class ParentDocumentRetriever:
def __init__(self):
self.small_chunks = [] # For embedding
self.large_chunks = [] # For LLM context
self.chunk_to_parent = {} # Mapping
def add_document(self, text):
# Create large chunks (parents)
large = chunk_by_tokens(text, chunk_size=2000)
# Create small chunks (children)
for i, parent in enumerate(large):
small = chunk_by_tokens(parent, chunk_size=500)
for child in small:
child_id = len(self.small_chunks)
self.small_chunks.append(child)
self.chunk_to_parent[child_id] = i
self.large_chunks.extend(large)
def retrieve(self, query, k=5):
# Retrieve small chunks
small_results = vector_search(query, self.small_chunks, k=k)
# Return corresponding large chunks
parent_ids = [self.chunk_to_parent[r.id] for r in small_results]
return [self.large_chunks[pid] for pid in parent_ids]
Evaluation
Test different strategies on your data:
def evaluate_chunking(documents, queries, ground_truth):
strategies = {
'fixed': lambda d: chunk_by_tokens(d, 512),
'recursive': lambda d: RecursiveCharacterTextSplitter().split_text(d),
'semantic': lambda d: semantic_chunk(d)
}
results = {}
for name, strategy in strategies.items():
chunks = [strategy(doc) for doc in documents]
recall = measure_recall(chunks, queries, ground_truth)
results[name] = recall
return results
Best Practices
- Start with 512 tokens - Good default for most use cases
- Add 10-20% overlap - Prevents boundary issues
- Preserve metadata - Track source document, page number
- Test on your data - Optimal size varies by domain
- Monitor in production - Track retrieval quality metrics
Strategy for Long Documents (1000+ pages)
From Production: "If you have extremely long documents, start with a page-level approach to determine if answers typically exist on a single page or span multiple pages."
The RAPTOR Approach
For very long documents (legal, technical manuals), simple chunking fails because concepts span pages.
- Cluster Chunks: Embed every page/chunk and run a clustering model
- Summarize Clusters: Identify concepts that span multiple pages and summarize them
- Retrieve Summaries: Use the summaries for retrieval
- Expand Context: If a summary is retrieved, include all related pages in the context
Cost Note: This preprocessing might cost ~$10 of LLM calls per document, but for "evergreen" documents (like tax law) that don't change for years, it's a high-ROI investment.
Common Questions
"How do I handle metadata in chunks?"
Include it in the chunk text.
- Why: Allows answering questions like "who wrote this" or "when was this updated"
- How: Prepend metadata strings:
Title: Annual Report 2023 | Author: Jane Doe | Date: 2023-12-01 \n\n [Content...] - Benefit: Enables function calling (e.g., "Emily wrote this, here is her email")
"Should I use semantic chunking?"
Yes, for single complex documents.
- Use Case: Proposals, RFPs, or contracts where requirements for different disciplines are scattered
- Technique: Generate synthetic questions per paragraph ("What requirements are mentioned here?") rather than just splitting by tokens
- Goal: Separate paragraphs based on semantic meaning rather than arbitrary length
"Does a larger context window mean I don't need chunking?"
No.
- Latency: Processing 1M tokens takes time and money
- Focus: "Needle in a haystack" performance degrades with context length
- Analogy: "Amazon could score every product for every user, but 100ms latency costs 1% revenue. We still need efficient retrieval."
Next Steps
- Parent Document Retrieval - Advanced chunking strategy
- Retrieval Fundamentals - Use chunks effectively