Vector Storage Fundamentals

Understanding vector databases, indexing strategies, and storage optimization for RAG systems.

Overview

Vector databases store embeddings and enable fast similarity search. Understanding storage fundamentals is critical for scaling RAG systems.

Vector Database Basics

What Gets Stored

{
    "id": "doc_123",
    "vector": [0.1, 0.2, ..., 0.768],  # 768-dim embedding
    "metadata": {
        "text": "Original chunk text",
        "source": "document.pdf",
        "page": 5,
        "created_at": "2024-01-01"
    }
}

Similarity Metrics

Cosine Similarity (most common):

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Euclidean Distance:

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

Dot Product (for normalized vectors):

def dot_product(a, b):
    return np.dot(a, b)

Indexing Strategies

Flat Index (Exact Search)

  • Compares query to every vector
  • 100% accurate
  • O(n) complexity
  • Use when: <100k vectors

HNSW (Hierarchical Navigable Small World)

  • Graph-based approximate search
  • Fast queries (~10ms)
  • High recall (>95%)
  • Use when: >100k vectors, need speed

IVF (Inverted File Index)

  • Clusters vectors, searches nearest clusters
  • Memory efficient
  • Good for large datasets
  • Use when: >1M vectors, limited memory

Storage Optimization

Quantization

Reduce vector precision to save space:

# Original: 768 floats × 4 bytes = 3KB per vector
# Quantized: 768 bytes = 768 bytes per vector (4x smaller)

def quantize_vector(vector, bits=8):
    # Scale to [0, 2^bits - 1]
    min_val, max_val = vector.min(), vector.max()
    scaled = (vector - min_val) / (max_val - min_val)
    quantized = (scaled * (2**bits - 1)).astype(np.uint8)
    return quantized, min_val, max_val

def dequantize_vector(quantized, min_val, max_val, bits=8):
    scaled = quantized.astype(np.float32) / (2**bits - 1)
    return scaled * (max_val - min_val) + min_val

Savings: 75% storage reduction with minimal accuracy loss

Dimensionality Reduction

from sklearn.decomposition import PCA

# Reduce 768 dims to 256 dims
pca = PCA(n_components=256)
reduced_vectors = pca.fit_transform(original_vectors)

# 66% storage reduction

Partitioning Strategies

By Metadata

# Store different document types in separate collections
collections = {
    'legal': vector_db.create_collection('legal_docs'),
    'technical': vector_db.create_collection('technical_docs'),
    'marketing': vector_db.create_collection('marketing_docs')
}

# Query only relevant collection
results = collections['legal'].search(query_vector)

By Time

# Partition by month for time-series data
collection_name = f"docs_{year}_{month}"
vector_db.create_collection(collection_name)

# Query recent data first
recent_results = vector_db.search('docs_2024_12', query)

Backup & Recovery

import json

def backup_collection(collection_name, output_file):
    """Export vectors and metadata to JSON"""
    vectors = vector_db.get_all(collection_name)
    
    with open(output_file, 'w') as f:
        json.dump(vectors, f)

def restore_collection(collection_name, input_file):
    """Restore from backup"""
    with open(input_file, 'r') as f:
        vectors = json.load(f)
    
    vector_db.upsert(collection_name, vectors)

Performance Tuning

Batch Operations

# Bad: Insert one at a time
for vector in vectors:
    db.insert(vector)  # 1000 network calls

# Good: Batch insert
db.insert_batch(vectors, batch_size=100)  # 10 network calls

Connection Pooling

from qdrant_client import QdrantClient

# Reuse connections
client = QdrantClient(
    url="localhost:6333",
    timeout=60,
    prefer_grpc=True  # Faster than HTTP
)

Monitoring

Track these metrics:

metrics = {
    'total_vectors': db.count(),
    'storage_size_gb': db.get_storage_size() / 1e9,
    'avg_query_latency_ms': db.get_avg_latency(),
    'p95_query_latency_ms': db.get_p95_latency(),
    'index_size_gb': db.get_index_size() / 1e9
}

Cost Optimization

VectorsStorage CostQuery CostTotal/Month
100K$5$10$15
1M$50$50$100
10M$500$200$700

Optimization tips:

  • Use quantization (75% savings)
  • Partition by metadata (query less data)
  • Use cheaper storage tiers for cold data

Next Steps