RAG Cost Optimization
Strategies to reduce token usage and infrastructure costs in RAG systems by up to 90%.
Overview
RAG systems can be expensive. Costs accrue from three main sources:
- LLM Inference: Input tokens (context) + Output tokens (generation).
- Embedding: Converting text to vectors.
- Vector Storage: Hosting and searching vectors.
This guide outlines strategies to optimize each layer.
1. Optimizing LLM Costs (The Biggest Spender)
The context window is the most expensive part. Sending 10 retrieved documents (e.g., 5k tokens) for every query adds up fast.
Strategy A: Prompt Compression
Don't send raw chunks. Compress them.
- Summarization: Use a cheaper, faster model (e.g., GPT-3.5-Turbo, Haiku) to summarize retrieved chunks before sending them to the expensive model (e.g., GPT-4).
- Selective Context: Use a re-ranker to pick only the top 3 highly relevant chunks instead of the top 10.
Strategy B: Model Cascading
Not every query needs a PhD-level model.
- Router Pattern: Classify query complexity.
- Simple ("What is the refund policy?"): Route to Llama 3 8B or GPT-3.5.
- Complex ("Compare the liability clauses in these two contracts"): Route to GPT-4o / Claude 3.5 Sonnet.
Strategy C: Caching
The cheapest request is the one you don't make.
- Exact Match Cache: Redis key-value store for identical queries.
- Semantic Cache: Return cached answers for semantically similar queries (Similarity > 0.95).
2. Optimizing Embedding Costs
Strategy A: Open Source Models
For most English-language RAG tasks, open-source models are highly competitive and free to run.
- Switch: From OpenAI
text-embedding-3($$) →bge-m3orall-MiniLM-L6-v2(Free/Cheap). - Self-Host: Run on CPU or small GPU instance.
all-MiniLMruns incredibly fast on CPU.
Strategy B: Lazy Embedding
Don't re-embed everything on every minor update.
- Hashing: Hash document content. Only re-embed if the hash changes.
3. Optimizing Vector Storage
Strategy A: Dimensionality Reduction
Smaller vectors = less RAM = cheaper hosting.
- Matryoshka Embeddings: New models (like
text-embedding-3) allow truncating vectors (e.g., 1536 → 256 dims) with minimal performance loss. - Binary Quantization: Convert floats to bits. Reduces storage by 32x.
Strategy B: Serverless Vector DBs
Don't pay for idle time.
- Use: LanceDB (runs on S3/Disk), Pinecone Serverless, or Neon (Postgres + pgvector).
Cost Analysis Example
Scenario: 1,000 queries/day. 5,000 tokens context per query.
Unoptimized (GPT-4o):
- Input: 1k * 5k = 5M tokens/day.
- Price: ~$5/1M tokens.
- Daily Cost: $25.
- Yearly Cost: ~$9,000.
Optimized (GPT-4o-mini + Re-ranking):
- Reduce context to 2k tokens (better re-ranking).
- Input: 1k * 2k = 2M tokens/day.
- Price: ~$0.15/1M tokens.
- Daily Cost: $0.30.
- Yearly Cost: ~$110.
Savings: ~98% reduction.
Checklist for Cost Reduction
- Re-ranking implemented to reduce
top_k. - Caching enabled for frequent queries.
- Open Source Embeddings evaluated.
- Model Router set up for simple vs. complex queries.
- Prompt Engineering to reduce verbose system instructions.
Next Steps
- Context Window Management - Technical details on handling tokens.
- Choosing Embedding Models - Performance vs Cost trade-offs.