RAG Implementation Guide

Overview

Embeddings are dense vector representations of text that capture semantic meaning. Unlike traditional keyword matching, embeddings enable semantic search by representing similar concepts close together in vector space, even when they use different words.

What Are Embeddings?

An embedding is a numerical representation (vector) of text where:

Each piece of text maps to a point in high-dimensional space (typically 384-1536 dimensions)
Semantically similar texts are positioned close together
Distance between vectors indicates semantic similarity

Example:

"dog" → [0.2, 0.8, 0.1, ..., 0.5]  (384 dimensions)
"puppy" → [0.3, 0.7, 0.2, ..., 0.4]  (close to "dog")
"car" → [0.9, 0.1, 0.8, ..., 0.2]  (far from "dog")

Why Embeddings Matter for RAG

Traditional keyword search fails when:

Users phrase questions differently than documents
Synonyms and related concepts aren't matched
Context and meaning are ignored

Embeddings solve this by understanding semantic similarity:

"How do I reset my password?" matches "Password recovery instructions"
"Refund policy" matches "Money back guarantee"
"ML model training" matches "Machine learning algorithm development"

How Embeddings Work

1. Text Encoding

Text is converted to vectors using neural networks trained on massive text corpora:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode text to vector
text = "How do I reset my password?"
embedding = model.encode(text)

print(embedding.shape)  # (384,) - 384-dimensional vector

2. Similarity Calculation

Compare embeddings using cosine similarity:

import numpy as np

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

# Compare query to documents
query_emb = model.encode("password reset")
doc1_emb = model.encode("How to recover your account password")
doc2_emb = model.encode("Shipping policy information")

sim1 = cosine_similarity(query_emb, doc1_emb)  # High: ~0.85
sim2 = cosine_similarity(query_emb, doc2_emb)  # Low: ~0.15

3. Vector Storage and Search

Store embeddings in vector databases for efficient similarity search:

import lancedb

# Create database
db = lancedb.connect("./my-database")

# Store documents with embeddings
data = [
    {
        "id": 1,
        "text": "How to reset your password",
        "vector": model.encode("How to reset your password").tolist()
    },
    {
        "id": 2,
        "text": "Shipping policy",
        "vector": model.encode("Shipping policy").tolist()
    }
]

table = db.create_table("documents", data)

# Search for similar documents
query = "password recovery"
query_vector = model.encode(query)

results = table.search(query_vector).limit(5).to_list()

Common Embedding Models

Small & Fast

all-MiniLM-L6-v2 (384 dim)
- Speed: Very fast
- Quality: Good for general use
- Best for: High-throughput applications

Balanced

all-mpnet-base-v2 (768 dim)
- Speed: Moderate
- Quality: Better semantic understanding
- Best for: Most RAG applications

High Quality

text-embedding-3-large (OpenAI, 3072 dim)
- Speed: API-dependent
- Quality: Excellent
- Best for: When accuracy is critical

Embedding Dimensions

More dimensions ≠ always better:

Dimensions	Pros	Cons
384	Fast, low storage	Less nuanced
768	Good balance	Moderate cost
1536+	High quality	Slow, expensive

Rule of thumb: Start with 384-768 dimensions, upgrade only if evaluation shows clear benefit.

Key Concepts

Semantic Similarity

Embeddings capture meaning, not just words:

# These are semantically similar despite different words
model.encode("automobile")  # Similar to ↓
model.encode("car")         # Similar to ↑
model.encode("vehicle")     # Similar to both

Context Window

Models have maximum input lengths:

Small models: 256-512 tokens
Large models: 512-8192 tokens

Exceeding limits: Text gets truncated, losing information.

Normalization

Embeddings are typically normalized (length = 1):

import numpy as np

embedding = model.encode("text")
normalized = embedding / np.linalg.norm(embedding)

This allows using dot product instead of cosine similarity (faster).

Practical Implementation

Basic RAG Pipeline

from sentence_transformers import SentenceTransformer
import lancedb

class SimpleRAG:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.db = lancedb.connect("./rag-db")
        
    def add_documents(self, documents):
        """Add documents to vector database"""
        data = []
        for i, doc in enumerate(documents):
            data.append({
                "id": i,
                "text": doc,
                "vector": self.model.encode(doc).tolist()
            })
        
        self.table = self.db.create_table("docs", data, mode="overwrite")
    
    def search(self, query, k=5):
        """Search for relevant documents"""
        query_vector = self.model.encode(query)
        results = self.table.search(query_vector).limit(k).to_list()
        return [r['text'] for r in results]

# Usage
rag = SimpleRAG()
rag.add_documents([
    "Python is a programming language",
    "Machine learning uses algorithms",
    "Embeddings represent text as vectors"
])

results = rag.search("What is Python?")
print(results[0])  # "Python is a programming language"

Common Pitfalls

1. Using Wrong Model for Domain

Generic models struggle with specialized terminology:

Medical: "MI" (myocardial infarction vs. Michigan)
Legal: "consideration" (legal term vs. general meaning)

Solution: Use domain-specific models or fine-tune.

2. Ignoring Chunk Size

Embedding entire documents loses granularity:

# Bad: Embed whole document
doc_embedding = model.encode(entire_document)  # Too broad

# Good: Embed chunks
chunks = split_document(document, chunk_size=512)
chunk_embeddings = [model.encode(chunk) for chunk in chunks]

3. Not Normalizing Queries

Inconsistent query formatting affects results:

# Inconsistent
query1 = "how do i reset password?"
query2 = "How do I reset my password?"

# Better: Normalize
def normalize_query(q):
    return q.lower().strip()

Performance Considerations

Batch Processing

Process multiple texts together for speed:

# Slow: One at a time
for text in texts:
    embedding = model.encode(text)

# Fast: Batch processing
embeddings = model.encode(texts, batch_size=32)

Caching

Cache embeddings for frequently accessed documents:

import pickle

# Save embeddings
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

# Load embeddings
with open('embeddings.pkl', 'rb') as f:
    embeddings = pickle.load(f)

Next Steps

Choosing Embedding Models - Select the right model for your use case
Fine-Tuning Embeddings - Improve performance with domain-specific training
Multilingual Embeddings - Handle multiple languages

Embedding Fundamentals