RAG Implementation Guide

Overview

RAG systems can be expensive. Costs accrue from three main sources:

This guide outlines strategies to optimize each layer.

The context window is the most expensive part. Sending 10 retrieved documents (e.g., 5k tokens) for every query adds up fast.

Don't send raw chunks. Compress them.

Summarization: Use a cheaper, faster model (e.g., GPT-3.5-Turbo, Haiku) to summarize retrieved chunks before sending them to the expensive model (e.g., GPT-4).
Selective Context: Use a re-ranker to pick only the top 3 highly relevant chunks instead of the top 10.

Not every query needs a PhD-level model.

Router Pattern: Classify query complexity.
- Simple ("What is the refund policy?"): Route to Llama 3 8B or GPT-3.5.
- Complex ("Compare the liability clauses in these two contracts"): Route to GPT-4o / Claude 3.5 Sonnet.

The cheapest request is the one you don't make.

Exact Match Cache: Redis key-value store for identical queries.
Semantic Cache: Return cached answers for semantically similar queries (Similarity > 0.95).

For most English-language RAG tasks, open-source models are highly competitive and free to run.

Switch: From OpenAI text-embedding-3 ($$) → bge-m3 or all-MiniLM-L6-v2 (Free/Cheap).
Self-Host: Run on CPU or small GPU instance. all-MiniLM runs incredibly fast on CPU.

Don't re-embed everything on every minor update.

Smaller vectors = less RAM = cheaper hosting.

Matryoshka Embeddings: New models (like text-embedding-3) allow truncating vectors (e.g., 1536 → 256 dims) with minimal performance loss.
Binary Quantization: Convert floats to bits. Reduces storage by 32x.

Don't pay for idle time.

Use: LanceDB (runs on S3/Disk), Pinecone Serverless, or Neon (Postgres + pgvector).

Scenario: 1,000 queries/day. 5,000 tokens context per query.

Unoptimized (GPT-4o):

Optimized (GPT-4o-mini + Re-ranking):

Savings: ~98% reduction.