Document Parsing
Extract text and structure from PDFs, HTML, Word documents, and other formats for RAG systems.
Overview
Document parsing is the critical first step in RAG pipelines. Poor parsing leads to garbage in your vector database, no matter how good your embeddings are.
Common Document Types
PDF Documents
Challenges:
- Scanned PDFs (images, need OCR)
- Multi-column layouts
- Tables and figures
- Headers/footers/page numbers
Best Tools:
# 1. PyMuPDF (fastest, good for text-based PDFs)
import fitz # PyMuPDF
def parse_pdf_pymupdf(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
return text
# 2. pdfplumber (best for tables)
import pdfplumber
def parse_pdf_with_tables(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text = page.extract_text()
tables = page.extract_tables()
# Process both text and tables
return text, tables
# 3. Unstructured (best for complex layouts)
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("document.pdf")
for element in elements:
print(f"{element.category}: {element.text}")
HTML & Web Pages
from bs4 import BeautifulSoup
import requests
def parse_html(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Remove script and style elements
for script in soup(["script", "style", "nav", "footer"]):
script.decompose()
# Get text
text = soup.get_text()
# Clean up whitespace
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
Word Documents
from docx import Document
def parse_docx(file_path):
doc = Document(file_path)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
# Extract tables
for table in doc.tables:
for row in table.rows:
row_data = [cell.text for cell in row.cells]
full_text.append(" | ".join(row_data))
return '\n'.join(full_text)
OCR for Scanned Documents
import pytesseract
from pdf2image import convert_from_path
from PIL import Image
def ocr_pdf(pdf_path):
# Convert PDF to images
images = convert_from_path(pdf_path)
text = ""
for i, image in enumerate(images):
# Perform OCR
page_text = pytesseract.image_to_string(image)
text += f"\n--- Page {i+1} ---\n{page_text}"
return text
# For better accuracy, use cloud OCR
from google.cloud import vision
def google_ocr(image_path):
client = vision.ImageAnnotatorClient()
with open(image_path, 'rb') as image_file:
content = image_file.read()
image = vision.Image(content=content)
response = client.text_detection(image=image)
return response.full_text_annotation.text
Handling Tables
Tables are critical for many domains (finance, research, legal):
import camelot # Best for PDF tables
# Extract tables from PDF
tables = camelot.read_pdf('document.pdf', pages='all')
for table in tables:
df = table.df # Pandas DataFrame
# Convert to markdown for better embedding
markdown_table = df.to_markdown(index=False)
# Or convert to natural language
for _, row in df.iterrows():
text = f"The {row['Category']} has a value of {row['Value']}"
Metadata Extraction
Extract metadata for better filtering:
def extract_metadata(file_path):
import os
from datetime import datetime
metadata = {
'filename': os.path.basename(file_path),
'file_size': os.path.getsize(file_path),
'created_at': datetime.fromtimestamp(os.path.getctime(file_path)),
'modified_at': datetime.fromtimestamp(os.path.getmtime(file_path))
}
# For PDFs, extract PDF metadata
if file_path.endswith('.pdf'):
import fitz
doc = fitz.open(file_path)
metadata.update({
'title': doc.metadata.get('title', ''),
'author': doc.metadata.get('author', ''),
'subject': doc.metadata.get('subject', ''),
'num_pages': len(doc)
})
return metadata
Production Pipeline
from pathlib import Path
from typing import List, Dict
class DocumentParser:
def __init__(self):
self.parsers = {
'.pdf': self.parse_pdf,
'.html': self.parse_html,
'.docx': self.parse_docx,
'.txt': self.parse_txt
}
def parse(self, file_path: str) -> Dict:
"""Parse document and return text + metadata"""
path = Path(file_path)
extension = path.suffix.lower()
if extension not in self.parsers:
raise ValueError(f"Unsupported file type: {extension}")
text = self.parsers[extension](file_path)
metadata = self.extract_metadata(file_path)
return {
'text': text,
'metadata': metadata,
'source': str(path)
}
def parse_batch(self, directory: str) -> List[Dict]:
"""Parse all documents in a directory"""
results = []
for file_path in Path(directory).rglob('*'):
if file_path.is_file():
try:
result = self.parse(str(file_path))
results.append(result)
except Exception as e:
print(f"Error parsing {file_path}: {e}")
return results
Best Practices
1. Preserve Structure
# Bad: Lose all structure
text = doc.get_text()
# Good: Preserve headings and sections
sections = []
for element in elements:
if element.category == "Title":
sections.append({"title": element.text, "content": []})
elif element.category == "NarrativeText":
sections[-1]["content"].append(element.text)
2. Handle Errors Gracefully
def safe_parse(file_path):
try:
return parse_pdf(file_path)
except Exception as e:
logging.error(f"Failed to parse {file_path}: {e}")
# Fallback to OCR
try:
return ocr_pdf(file_path)
except:
return None
3. Clean Text
import re
def clean_text(text):
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text)
# Remove page numbers
text = re.sub(r'Page \d+', '', text)
# Remove headers/footers (domain-specific)
text = re.sub(r'Company Confidential.*?\n', '', text)
return text.strip()
Tool Comparison
| Tool | Best For | Speed | Accuracy | Tables |
|---|---|---|---|---|
| PyMuPDF | Text-based PDFs | ⚡⚡⚡ | ⭐⭐⭐ | ⭐ |
| pdfplumber | PDFs with tables | ⚡⚡ | ⭐⭐⭐ | ⭐⭐⭐ |
| Unstructured | Complex layouts | ⚡ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Tesseract OCR | Scanned docs | ⚡ | ⭐⭐ | ⭐ |
| Google Vision | Scanned docs | ⚡⚡ | ⭐⭐⭐⭐ | ⭐⭐ |
Parsing Strategy: Build vs. Buy
From Production: "I've leaned on parsing vendors because they're the most incentivized to have good and accurate labels. This lets me focus on retrieval, which is what will create the most value for my specific use case."
When to Use Vendors (Reducto, Llama Parse, etc.)
- Complex Layouts: Multi-column PDFs, scientific papers, financial reports
- Tables & Charts: When you need to preserve row/column structure or extract data from charts
- Speed: Commercial tools are often optimized for throughput
- Maintenance: PDF standards change; vendors handle the updates
When to Build Your Own
- Simple Documents: Text-heavy contracts or books
- Cost Sensitivity: High volume of simple documents where API costs would be prohibitive
- Data Privacy: Strictly air-gapped requirements where data cannot leave your VPC
Common Questions
"How do I evaluate parsing quality?"
Evaluate parsing separately from retrieval. If you parse an "8" as a "0" and generate synthetic data from that, you won't capture the error in your RAG evaluation.
Evaluation Checklist:
- OCR Accuracy: Is text parsed correctly? (e.g., "0" vs "8", "l" vs "1")
- Bounding Boxes: Are tables fully recognized as single units?
- Reading Order: Does the text flow logically across columns?
"How do I handle complex layouts like PowerPoint?"
For documents with complex layouts, parsing and chunking are linked.
- Approach 1: Use multimodal models (e.g., Gemini 2, GPT-4o) to describe each slide
- Approach 2: Use specialized tools like Reducto (often higher accuracy: ~0.9 vs ~0.84 for general models)
- Approach 3: Convert to Markdown first, then chunk based on headers
"Should I use a general multimodal model for parsing?"
General models (Gemini 1.5 Pro, GPT-4o) are closing the gap, but specialized tools still often win on precision.
- Gemini 1.5 Pro: Great for "good enough" parsing at scale
- Specialized Tools: Necessary for high-precision extraction (e.g., financial data)
Next Steps
- Chunking Strategies - Split parsed text into chunks
- Metadata Filtering - Use extracted metadata for better retrieval