RAG Implementation Guide

Overview

Chunking is the process of splitting documents into smaller pieces before embedding. Chunk size is the single most important hyperparameter in RAG systems.

Why Chunking Matters

Too small: Loses context, poor retrieval
Too large: Exceeds context window, dilutes relevance
Just right: Balances context and specificity

Fixed-Size Chunking

Simplest approach: split by character or token count.

def chunk_by_tokens(text, chunk_size=512, overlap=50):
    import tiktoken
    
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = enc.decode(chunk_tokens)
        chunks.append(chunk_text)
    
    return chunks

Pros: Simple, predictable Cons: Breaks mid-sentence, ignores document structure

Recursive Character Splitting

Split by natural boundaries (paragraphs, sentences).

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document)

Pros: Respects natural boundaries Cons: Still arbitrary, ignores semantics

Semantic Chunking

Split based on meaning, not length.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_chunk(text, threshold=0.5):
    sentences = text.split('. ')
    embeddings = model.encode(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        # Calculate similarity with previous sentence
        similarity = np.dot(embeddings[i], embeddings[i-1])
        
        if similarity < threshold:
            # Start new chunk
            chunks.append('. '.join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    chunks.append('. '.join(current_chunk))
    return chunks

Pros: Semantically coherent chunks Cons: Slower, variable chunk sizes

Document-Aware Chunking

Respect document structure (headings, sections).

def chunk_by_sections(markdown_text):
    import re
    
    # Split by headings
    sections = re.split(r'\n(#{1,6} .+)\n', markdown_text)
    
    chunks = []
    current_heading = ""
    
    for i, section in enumerate(sections):
        if section.startswith('#'):
            current_heading = section
        else:
            # Include heading in chunk for context
            chunk = f"{current_heading}\n{section}"
            chunks.append(chunk)
    
    return chunks

Chunk Size Guidelines

Use Case	Chunk Size	Overlap	Reasoning
Q&A	256-512 tokens	50-100	Short, specific answers
Long-form	1000-1500 tokens	200-300	Need more context
Code	500-1000 tokens	100-200	Function/class level
Legal	1500-2000 tokens	300-400	Preserve clauses

Overlap Strategy

def chunk_with_overlap(text, chunk_size=500, overlap=100):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

Why overlap?

Prevents information loss at boundaries
Improves recall for edge cases
Typical overlap: 10-20% of chunk size

Advanced: Parent Document Retrieval

Store small chunks for retrieval, large chunks for context.

class ParentDocumentRetriever:
    def __init__(self):
        self.small_chunks = []  # For embedding
        self.large_chunks = []  # For LLM context
        self.chunk_to_parent = {}  # Mapping
    
    def add_document(self, text):
        # Create large chunks (parents)
        large = chunk_by_tokens(text, chunk_size=2000)
        
        # Create small chunks (children)
        for i, parent in enumerate(large):
            small = chunk_by_tokens(parent, chunk_size=500)
            
            for child in small:
                child_id = len(self.small_chunks)
                self.small_chunks.append(child)
                self.chunk_to_parent[child_id] = i
        
        self.large_chunks.extend(large)
    
    def retrieve(self, query, k=5):
        # Retrieve small chunks
        small_results = vector_search(query, self.small_chunks, k=k)
        
        # Return corresponding large chunks
        parent_ids = [self.chunk_to_parent[r.id] for r in small_results]
        return [self.large_chunks[pid] for pid in parent_ids]

Evaluation

Test different strategies on your data:

def evaluate_chunking(documents, queries, ground_truth):
    strategies = {
        'fixed': lambda d: chunk_by_tokens(d, 512),
        'recursive': lambda d: RecursiveCharacterTextSplitter().split_text(d),
        'semantic': lambda d: semantic_chunk(d)
    }
    
    results = {}
    for name, strategy in strategies.items():
        chunks = [strategy(doc) for doc in documents]
        recall = measure_recall(chunks, queries, ground_truth)
        results[name] = recall
    
    return results

Best Practices

Start with 512 tokens - Good default for most use cases
Add 10-20% overlap - Prevents boundary issues
Preserve metadata - Track source document, page number
Test on your data - Optimal size varies by domain
Monitor in production - Track retrieval quality metrics

Strategy for Long Documents (1000+ pages)

From Production: "If you have extremely long documents, start with a page-level approach to determine if answers typically exist on a single page or span multiple pages."

The RAPTOR Approach

For very long documents (legal, technical manuals), simple chunking fails because concepts span pages.

Cluster Chunks: Embed every page/chunk and run a clustering model
Summarize Clusters: Identify concepts that span multiple pages and summarize them
Retrieve Summaries: Use the summaries for retrieval
Expand Context: If a summary is retrieved, include all related pages in the context

Cost Note: This preprocessing might cost ~$10 of LLM calls per document, but for "evergreen" documents (like tax law) that don't change for years, it's a high-ROI investment.

Common Questions

"How do I handle metadata in chunks?"

Include it in the chunk text.

Why: Allows answering questions like "who wrote this" or "when was this updated"
How: Prepend metadata strings: Title: Annual Report 2023 | Author: Jane Doe | Date: 2023-12-01 \n\n [Content...]
Benefit: Enables function calling (e.g., "Emily wrote this, here is her email")

"Should I use semantic chunking?"

Yes, for single complex documents.

Use Case: Proposals, RFPs, or contracts where requirements for different disciplines are scattered
Technique: Generate synthetic questions per paragraph ("What requirements are mentioned here?") rather than just splitting by tokens
Goal: Separate paragraphs based on semantic meaning rather than arbitrary length

"Does a larger context window mean I don't need chunking?"

No.

Latency: Processing 1M tokens takes time and money
Focus: "Needle in a haystack" performance degrades with context length
Analogy: "Amazon could score every product for every user, but 100ms latency costs 1% revenue. We still need efficient retrieval."

Next Steps

Parent Document Retrieval - Advanced chunking strategy
Retrieval Fundamentals - Use chunks effectively