RAG Implementation Guide

Overview

Vector databases store embeddings and enable fast similarity search. Understanding storage fundamentals is critical for scaling RAG systems.

Vector Database Basics

What Gets Stored

{
    "id": "doc_123",
    "vector": [0.1, 0.2, ..., 0.768],  # 768-dim embedding
    "metadata": {
        "text": "Original chunk text",
        "source": "document.pdf",
        "page": 5,
        "created_at": "2024-01-01"
    }
}

Similarity Metrics

Cosine Similarity (most common):

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Euclidean Distance:

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

Dot Product (for normalized vectors):

def dot_product(a, b):
    return np.dot(a, b)

Indexing Strategies

Flat Index (Exact Search)

Compares query to every vector
100% accurate
O(n) complexity
Use when: <100k vectors

HNSW (Hierarchical Navigable Small World)

Graph-based approximate search
Fast queries (~10ms)
High recall (>95%)
Use when: >100k vectors, need speed

IVF (Inverted File Index)

Clusters vectors, searches nearest clusters
Memory efficient
Good for large datasets
Use when: >1M vectors, limited memory

Storage Optimization

Quantization

Reduce vector precision to save space:

# Original: 768 floats × 4 bytes = 3KB per vector
# Quantized: 768 bytes = 768 bytes per vector (4x smaller)

def quantize_vector(vector, bits=8):
    # Scale to [0, 2^bits - 1]
    min_val, max_val = vector.min(), vector.max()
    scaled = (vector - min_val) / (max_val - min_val)
    quantized = (scaled * (2**bits - 1)).astype(np.uint8)
    return quantized, min_val, max_val

def dequantize_vector(quantized, min_val, max_val, bits=8):
    scaled = quantized.astype(np.float32) / (2**bits - 1)
    return scaled * (max_val - min_val) + min_val

Savings: 75% storage reduction with minimal accuracy loss

Dimensionality Reduction

from sklearn.decomposition import PCA

# Reduce 768 dims to 256 dims
pca = PCA(n_components=256)
reduced_vectors = pca.fit_transform(original_vectors)

# 66% storage reduction

Partitioning Strategies

By Metadata

# Store different document types in separate collections
collections = {
    'legal': vector_db.create_collection('legal_docs'),
    'technical': vector_db.create_collection('technical_docs'),
    'marketing': vector_db.create_collection('marketing_docs')
}

# Query only relevant collection
results = collections['legal'].search(query_vector)

By Time

# Partition by month for time-series data
collection_name = f"docs_{year}_{month}"
vector_db.create_collection(collection_name)

# Query recent data first
recent_results = vector_db.search('docs_2024_12', query)

Backup & Recovery

import json

def backup_collection(collection_name, output_file):
    """Export vectors and metadata to JSON"""
    vectors = vector_db.get_all(collection_name)
    
    with open(output_file, 'w') as f:
        json.dump(vectors, f)

def restore_collection(collection_name, input_file):
    """Restore from backup"""
    with open(input_file, 'r') as f:
        vectors = json.load(f)
    
    vector_db.upsert(collection_name, vectors)

Performance Tuning

Batch Operations

# Bad: Insert one at a time
for vector in vectors:
    db.insert(vector)  # 1000 network calls

# Good: Batch insert
db.insert_batch(vectors, batch_size=100)  # 10 network calls

Connection Pooling

from qdrant_client import QdrantClient

# Reuse connections
client = QdrantClient(
    url="localhost:6333",
    timeout=60,
    prefer_grpc=True  # Faster than HTTP
)

Monitoring

Track these metrics:

metrics = {
    'total_vectors': db.count(),
    'storage_size_gb': db.get_storage_size() / 1e9,
    'avg_query_latency_ms': db.get_avg_latency(),
    'p95_query_latency_ms': db.get_p95_latency(),
    'index_size_gb': db.get_index_size() / 1e9
}

Cost Optimization

Vectors	Storage Cost	Query Cost	Total/Month
100K	$5	$10	$15
1M	$50	$50	$100
10M	$500	$200	$700

Optimization tips:

Use quantization (75% savings)
Partition by metadata (query less data)
Use cheaper storage tiers for cold data

Next Steps

Retrieval Fundamentals - Query vector databases effectively
Production Deployment - Scale to production