Document Parsing

Extract text and structure from PDFs, HTML, Word documents, and other formats for RAG systems.

Overview

Document parsing is the critical first step in RAG pipelines. Poor parsing leads to garbage in your vector database, no matter how good your embeddings are.

Common Document Types

PDF Documents

Challenges:

  • Scanned PDFs (images, need OCR)
  • Multi-column layouts
  • Tables and figures
  • Headers/footers/page numbers

Best Tools:

# 1. PyMuPDF (fastest, good for text-based PDFs)
import fitz  # PyMuPDF

def parse_pdf_pymupdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# 2. pdfplumber (best for tables)
import pdfplumber

def parse_pdf_with_tables(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            tables = page.extract_tables()
            # Process both text and tables
    return text, tables

# 3. Unstructured (best for complex layouts)
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("document.pdf")
for element in elements:
    print(f"{element.category}: {element.text}")

HTML & Web Pages

from bs4 import BeautifulSoup
import requests

def parse_html(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Remove script and style elements
    for script in soup(["script", "style", "nav", "footer"]):
        script.decompose()
    
    # Get text
    text = soup.get_text()
    
    # Clean up whitespace
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    
    return text

Word Documents

from docx import Document

def parse_docx(file_path):
    doc = Document(file_path)
    
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    
    # Extract tables
    for table in doc.tables:
        for row in table.rows:
            row_data = [cell.text for cell in row.cells]
            full_text.append(" | ".join(row_data))
    
    return '\n'.join(full_text)

OCR for Scanned Documents

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

def ocr_pdf(pdf_path):
    # Convert PDF to images
    images = convert_from_path(pdf_path)
    
    text = ""
    for i, image in enumerate(images):
        # Perform OCR
        page_text = pytesseract.image_to_string(image)
        text += f"\n--- Page {i+1} ---\n{page_text}"
    
    return text

# For better accuracy, use cloud OCR
from google.cloud import vision

def google_ocr(image_path):
    client = vision.ImageAnnotatorClient()
    
    with open(image_path, 'rb') as image_file:
        content = image_file.read()
    
    image = vision.Image(content=content)
    response = client.text_detection(image=image)
    
    return response.full_text_annotation.text

Handling Tables

Tables are critical for many domains (finance, research, legal):

import camelot  # Best for PDF tables

# Extract tables from PDF
tables = camelot.read_pdf('document.pdf', pages='all')

for table in tables:
    df = table.df  # Pandas DataFrame
    
    # Convert to markdown for better embedding
    markdown_table = df.to_markdown(index=False)
    
    # Or convert to natural language
    for _, row in df.iterrows():
        text = f"The {row['Category']} has a value of {row['Value']}"

Metadata Extraction

Extract metadata for better filtering:

def extract_metadata(file_path):
    import os
    from datetime import datetime
    
    metadata = {
        'filename': os.path.basename(file_path),
        'file_size': os.path.getsize(file_path),
        'created_at': datetime.fromtimestamp(os.path.getctime(file_path)),
        'modified_at': datetime.fromtimestamp(os.path.getmtime(file_path))
    }
    
    # For PDFs, extract PDF metadata
    if file_path.endswith('.pdf'):
        import fitz
        doc = fitz.open(file_path)
        metadata.update({
            'title': doc.metadata.get('title', ''),
            'author': doc.metadata.get('author', ''),
            'subject': doc.metadata.get('subject', ''),
            'num_pages': len(doc)
        })
    
    return metadata

Production Pipeline

from pathlib import Path
from typing import List, Dict

class DocumentParser:
    def __init__(self):
        self.parsers = {
            '.pdf': self.parse_pdf,
            '.html': self.parse_html,
            '.docx': self.parse_docx,
            '.txt': self.parse_txt
        }
    
    def parse(self, file_path: str) -> Dict:
        """Parse document and return text + metadata"""
        path = Path(file_path)
        extension = path.suffix.lower()
        
        if extension not in self.parsers:
            raise ValueError(f"Unsupported file type: {extension}")
        
        text = self.parsers[extension](file_path)
        metadata = self.extract_metadata(file_path)
        
        return {
            'text': text,
            'metadata': metadata,
            'source': str(path)
        }
    
    def parse_batch(self, directory: str) -> List[Dict]:
        """Parse all documents in a directory"""
        results = []
        for file_path in Path(directory).rglob('*'):
            if file_path.is_file():
                try:
                    result = self.parse(str(file_path))
                    results.append(result)
                except Exception as e:
                    print(f"Error parsing {file_path}: {e}")
        return results

Best Practices

1. Preserve Structure

# Bad: Lose all structure
text = doc.get_text()

# Good: Preserve headings and sections
sections = []
for element in elements:
    if element.category == "Title":
        sections.append({"title": element.text, "content": []})
    elif element.category == "NarrativeText":
        sections[-1]["content"].append(element.text)

2. Handle Errors Gracefully

def safe_parse(file_path):
    try:
        return parse_pdf(file_path)
    except Exception as e:
        logging.error(f"Failed to parse {file_path}: {e}")
        # Fallback to OCR
        try:
            return ocr_pdf(file_path)
        except:
            return None

3. Clean Text

import re

def clean_text(text):
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove page numbers
    text = re.sub(r'Page \d+', '', text)
    
    # Remove headers/footers (domain-specific)
    text = re.sub(r'Company Confidential.*?\n', '', text)
    
    return text.strip()

Tool Comparison

ToolBest ForSpeedAccuracyTables
PyMuPDFText-based PDFs⚡⚡⚡⭐⭐⭐
pdfplumberPDFs with tables⚡⚡⭐⭐⭐⭐⭐⭐
UnstructuredComplex layouts⭐⭐⭐⭐⭐⭐⭐
Tesseract OCRScanned docs⭐⭐
Google VisionScanned docs⚡⚡⭐⭐⭐⭐⭐⭐

Parsing Strategy: Build vs. Buy

From Production: "I've leaned on parsing vendors because they're the most incentivized to have good and accurate labels. This lets me focus on retrieval, which is what will create the most value for my specific use case."

When to Use Vendors (Reducto, Llama Parse, etc.)

  • Complex Layouts: Multi-column PDFs, scientific papers, financial reports
  • Tables & Charts: When you need to preserve row/column structure or extract data from charts
  • Speed: Commercial tools are often optimized for throughput
  • Maintenance: PDF standards change; vendors handle the updates

When to Build Your Own

  • Simple Documents: Text-heavy contracts or books
  • Cost Sensitivity: High volume of simple documents where API costs would be prohibitive
  • Data Privacy: Strictly air-gapped requirements where data cannot leave your VPC

Common Questions

"How do I evaluate parsing quality?"

Evaluate parsing separately from retrieval. If you parse an "8" as a "0" and generate synthetic data from that, you won't capture the error in your RAG evaluation.

Evaluation Checklist:

  1. OCR Accuracy: Is text parsed correctly? (e.g., "0" vs "8", "l" vs "1")
  2. Bounding Boxes: Are tables fully recognized as single units?
  3. Reading Order: Does the text flow logically across columns?

"How do I handle complex layouts like PowerPoint?"

For documents with complex layouts, parsing and chunking are linked.

  • Approach 1: Use multimodal models (e.g., Gemini 2, GPT-4o) to describe each slide
  • Approach 2: Use specialized tools like Reducto (often higher accuracy: ~0.9 vs ~0.84 for general models)
  • Approach 3: Convert to Markdown first, then chunk based on headers

"Should I use a general multimodal model for parsing?"

General models (Gemini 1.5 Pro, GPT-4o) are closing the gap, but specialized tools still often win on precision.

  • Gemini 1.5 Pro: Great for "good enough" parsing at scale
  • Specialized Tools: Necessary for high-precision extraction (e.g., financial data)

Next Steps