RAG Cost Optimization

Strategies to reduce token usage and infrastructure costs in RAG systems by up to 90%.

Overview

RAG systems can be expensive. Costs accrue from three main sources:

  1. LLM Inference: Input tokens (context) + Output tokens (generation).
  2. Embedding: Converting text to vectors.
  3. Vector Storage: Hosting and searching vectors.

This guide outlines strategies to optimize each layer.

1. Optimizing LLM Costs (The Biggest Spender)

The context window is the most expensive part. Sending 10 retrieved documents (e.g., 5k tokens) for every query adds up fast.

Strategy A: Prompt Compression

Don't send raw chunks. Compress them.

  • Summarization: Use a cheaper, faster model (e.g., GPT-3.5-Turbo, Haiku) to summarize retrieved chunks before sending them to the expensive model (e.g., GPT-4).
  • Selective Context: Use a re-ranker to pick only the top 3 highly relevant chunks instead of the top 10.

Strategy B: Model Cascading

Not every query needs a PhD-level model.

  • Router Pattern: Classify query complexity.
    • Simple ("What is the refund policy?"): Route to Llama 3 8B or GPT-3.5.
    • Complex ("Compare the liability clauses in these two contracts"): Route to GPT-4o / Claude 3.5 Sonnet.

Strategy C: Caching

The cheapest request is the one you don't make.

  • Exact Match Cache: Redis key-value store for identical queries.
  • Semantic Cache: Return cached answers for semantically similar queries (Similarity > 0.95).

2. Optimizing Embedding Costs

Strategy A: Open Source Models

For most English-language RAG tasks, open-source models are highly competitive and free to run.

  • Switch: From OpenAI text-embedding-3 ($$) → bge-m3 or all-MiniLM-L6-v2 (Free/Cheap).
  • Self-Host: Run on CPU or small GPU instance. all-MiniLM runs incredibly fast on CPU.

Strategy B: Lazy Embedding

Don't re-embed everything on every minor update.

  • Hashing: Hash document content. Only re-embed if the hash changes.

3. Optimizing Vector Storage

Strategy A: Dimensionality Reduction

Smaller vectors = less RAM = cheaper hosting.

  • Matryoshka Embeddings: New models (like text-embedding-3) allow truncating vectors (e.g., 1536 → 256 dims) with minimal performance loss.
  • Binary Quantization: Convert floats to bits. Reduces storage by 32x.

Strategy B: Serverless Vector DBs

Don't pay for idle time.

  • Use: LanceDB (runs on S3/Disk), Pinecone Serverless, or Neon (Postgres + pgvector).

Cost Analysis Example

Scenario: 1,000 queries/day. 5,000 tokens context per query.

Unoptimized (GPT-4o):

  • Input: 1k * 5k = 5M tokens/day.
  • Price: ~$5/1M tokens.
  • Daily Cost: $25.
  • Yearly Cost: ~$9,000.

Optimized (GPT-4o-mini + Re-ranking):

  • Reduce context to 2k tokens (better re-ranking).
  • Input: 1k * 2k = 2M tokens/day.
  • Price: ~$0.15/1M tokens.
  • Daily Cost: $0.30.
  • Yearly Cost: ~$110.

Savings: ~98% reduction.

Checklist for Cost Reduction

  • Re-ranking implemented to reduce top_k.
  • Caching enabled for frequent queries.
  • Open Source Embeddings evaluated.
  • Model Router set up for simple vs. complex queries.
  • Prompt Engineering to reduce verbose system instructions.

Next Steps