Chunking Strategies

Split documents into optimal chunks for embedding and retrieval in RAG systems.

Overview

Chunking is the process of splitting documents into smaller pieces before embedding. Chunk size is the single most important hyperparameter in RAG systems.

Why Chunking Matters

  • Too small: Loses context, poor retrieval
  • Too large: Exceeds context window, dilutes relevance
  • Just right: Balances context and specificity

Fixed-Size Chunking

Simplest approach: split by character or token count.

def chunk_by_tokens(text, chunk_size=512, overlap=50):
    import tiktoken
    
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = enc.decode(chunk_tokens)
        chunks.append(chunk_text)
    
    return chunks

Pros: Simple, predictable Cons: Breaks mid-sentence, ignores document structure

Recursive Character Splitting

Split by natural boundaries (paragraphs, sentences).

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document)

Pros: Respects natural boundaries Cons: Still arbitrary, ignores semantics

Semantic Chunking

Split based on meaning, not length.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_chunk(text, threshold=0.5):
    sentences = text.split('. ')
    embeddings = model.encode(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        # Calculate similarity with previous sentence
        similarity = np.dot(embeddings[i], embeddings[i-1])
        
        if similarity < threshold:
            # Start new chunk
            chunks.append('. '.join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    chunks.append('. '.join(current_chunk))
    return chunks

Pros: Semantically coherent chunks Cons: Slower, variable chunk sizes

Document-Aware Chunking

Respect document structure (headings, sections).

def chunk_by_sections(markdown_text):
    import re
    
    # Split by headings
    sections = re.split(r'\n(#{1,6} .+)\n', markdown_text)
    
    chunks = []
    current_heading = ""
    
    for i, section in enumerate(sections):
        if section.startswith('#'):
            current_heading = section
        else:
            # Include heading in chunk for context
            chunk = f"{current_heading}\n{section}"
            chunks.append(chunk)
    
    return chunks

Chunk Size Guidelines

Use CaseChunk SizeOverlapReasoning
Q&A256-512 tokens50-100Short, specific answers
Long-form1000-1500 tokens200-300Need more context
Code500-1000 tokens100-200Function/class level
Legal1500-2000 tokens300-400Preserve clauses

Overlap Strategy

def chunk_with_overlap(text, chunk_size=500, overlap=100):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

Why overlap?

  • Prevents information loss at boundaries
  • Improves recall for edge cases
  • Typical overlap: 10-20% of chunk size

Advanced: Parent Document Retrieval

Store small chunks for retrieval, large chunks for context.

class ParentDocumentRetriever:
    def __init__(self):
        self.small_chunks = []  # For embedding
        self.large_chunks = []  # For LLM context
        self.chunk_to_parent = {}  # Mapping
    
    def add_document(self, text):
        # Create large chunks (parents)
        large = chunk_by_tokens(text, chunk_size=2000)
        
        # Create small chunks (children)
        for i, parent in enumerate(large):
            small = chunk_by_tokens(parent, chunk_size=500)
            
            for child in small:
                child_id = len(self.small_chunks)
                self.small_chunks.append(child)
                self.chunk_to_parent[child_id] = i
        
        self.large_chunks.extend(large)
    
    def retrieve(self, query, k=5):
        # Retrieve small chunks
        small_results = vector_search(query, self.small_chunks, k=k)
        
        # Return corresponding large chunks
        parent_ids = [self.chunk_to_parent[r.id] for r in small_results]
        return [self.large_chunks[pid] for pid in parent_ids]

Evaluation

Test different strategies on your data:

def evaluate_chunking(documents, queries, ground_truth):
    strategies = {
        'fixed': lambda d: chunk_by_tokens(d, 512),
        'recursive': lambda d: RecursiveCharacterTextSplitter().split_text(d),
        'semantic': lambda d: semantic_chunk(d)
    }
    
    results = {}
    for name, strategy in strategies.items():
        chunks = [strategy(doc) for doc in documents]
        recall = measure_recall(chunks, queries, ground_truth)
        results[name] = recall
    
    return results

Best Practices

  1. Start with 512 tokens - Good default for most use cases
  2. Add 10-20% overlap - Prevents boundary issues
  3. Preserve metadata - Track source document, page number
  4. Test on your data - Optimal size varies by domain
  5. Monitor in production - Track retrieval quality metrics

Strategy for Long Documents (1000+ pages)

From Production: "If you have extremely long documents, start with a page-level approach to determine if answers typically exist on a single page or span multiple pages."

The RAPTOR Approach

For very long documents (legal, technical manuals), simple chunking fails because concepts span pages.

  1. Cluster Chunks: Embed every page/chunk and run a clustering model
  2. Summarize Clusters: Identify concepts that span multiple pages and summarize them
  3. Retrieve Summaries: Use the summaries for retrieval
  4. Expand Context: If a summary is retrieved, include all related pages in the context

Cost Note: This preprocessing might cost ~$10 of LLM calls per document, but for "evergreen" documents (like tax law) that don't change for years, it's a high-ROI investment.

Common Questions

"How do I handle metadata in chunks?"

Include it in the chunk text.

  • Why: Allows answering questions like "who wrote this" or "when was this updated"
  • How: Prepend metadata strings: Title: Annual Report 2023 | Author: Jane Doe | Date: 2023-12-01 \n\n [Content...]
  • Benefit: Enables function calling (e.g., "Emily wrote this, here is her email")

"Should I use semantic chunking?"

Yes, for single complex documents.

  • Use Case: Proposals, RFPs, or contracts where requirements for different disciplines are scattered
  • Technique: Generate synthetic questions per paragraph ("What requirements are mentioned here?") rather than just splitting by tokens
  • Goal: Separate paragraphs based on semantic meaning rather than arbitrary length

"Does a larger context window mean I don't need chunking?"

No.

  • Latency: Processing 1M tokens takes time and money
  • Focus: "Needle in a haystack" performance degrades with context length
  • Analogy: "Amazon could score every product for every user, but 100ms latency costs 1% revenue. We still need efficient retrieval."

Next Steps