Embedding Fundamentals
Introduction to embeddings and vector representations for semantic search
Overview
Embeddings are dense vector representations of text that capture semantic meaning. Unlike traditional keyword matching, embeddings enable semantic search by representing similar concepts close together in vector space, even when they use different words.
What Are Embeddings?
An embedding is a numerical representation (vector) of text where:
- Each piece of text maps to a point in high-dimensional space (typically 384-1536 dimensions)
- Semantically similar texts are positioned close together
- Distance between vectors indicates semantic similarity
Example:
"dog" → [0.2, 0.8, 0.1, ..., 0.5] (384 dimensions)
"puppy" → [0.3, 0.7, 0.2, ..., 0.4] (close to "dog")
"car" → [0.9, 0.1, 0.8, ..., 0.2] (far from "dog")
Why Embeddings Matter for RAG
Traditional keyword search fails when:
- Users phrase questions differently than documents
- Synonyms and related concepts aren't matched
- Context and meaning are ignored
Embeddings solve this by understanding semantic similarity:
- "How do I reset my password?" matches "Password recovery instructions"
- "Refund policy" matches "Money back guarantee"
- "ML model training" matches "Machine learning algorithm development"
How Embeddings Work
1. Text Encoding
Text is converted to vectors using neural networks trained on massive text corpora:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode text to vector
text = "How do I reset my password?"
embedding = model.encode(text)
print(embedding.shape) # (384,) - 384-dimensional vector
2. Similarity Calculation
Compare embeddings using cosine similarity:
import numpy as np
def cosine_similarity(vec1, vec2):
"""Calculate cosine similarity between two vectors"""
dot_product = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
return dot_product / (norm1 * norm2)
# Compare query to documents
query_emb = model.encode("password reset")
doc1_emb = model.encode("How to recover your account password")
doc2_emb = model.encode("Shipping policy information")
sim1 = cosine_similarity(query_emb, doc1_emb) # High: ~0.85
sim2 = cosine_similarity(query_emb, doc2_emb) # Low: ~0.15
3. Vector Storage and Search
Store embeddings in vector databases for efficient similarity search:
import lancedb
# Create database
db = lancedb.connect("./my-database")
# Store documents with embeddings
data = [
{
"id": 1,
"text": "How to reset your password",
"vector": model.encode("How to reset your password").tolist()
},
{
"id": 2,
"text": "Shipping policy",
"vector": model.encode("Shipping policy").tolist()
}
]
table = db.create_table("documents", data)
# Search for similar documents
query = "password recovery"
query_vector = model.encode(query)
results = table.search(query_vector).limit(5).to_list()
Common Embedding Models
Small & Fast
- all-MiniLM-L6-v2 (384 dim)
- Speed: Very fast
- Quality: Good for general use
- Best for: High-throughput applications
Balanced
- all-mpnet-base-v2 (768 dim)
- Speed: Moderate
- Quality: Better semantic understanding
- Best for: Most RAG applications
High Quality
- text-embedding-3-large (OpenAI, 3072 dim)
- Speed: API-dependent
- Quality: Excellent
- Best for: When accuracy is critical
Embedding Dimensions
More dimensions ≠ always better:
| Dimensions | Pros | Cons |
|---|---|---|
| 384 | Fast, low storage | Less nuanced |
| 768 | Good balance | Moderate cost |
| 1536+ | High quality | Slow, expensive |
Rule of thumb: Start with 384-768 dimensions, upgrade only if evaluation shows clear benefit.
Key Concepts
Semantic Similarity
Embeddings capture meaning, not just words:
# These are semantically similar despite different words
model.encode("automobile") # Similar to ↓
model.encode("car") # Similar to ↑
model.encode("vehicle") # Similar to both
Context Window
Models have maximum input lengths:
- Small models: 256-512 tokens
- Large models: 512-8192 tokens
Exceeding limits: Text gets truncated, losing information.
Normalization
Embeddings are typically normalized (length = 1):
import numpy as np
embedding = model.encode("text")
normalized = embedding / np.linalg.norm(embedding)
This allows using dot product instead of cosine similarity (faster).
Practical Implementation
Basic RAG Pipeline
from sentence_transformers import SentenceTransformer
import lancedb
class SimpleRAG:
def __init__(self, model_name='all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.db = lancedb.connect("./rag-db")
def add_documents(self, documents):
"""Add documents to vector database"""
data = []
for i, doc in enumerate(documents):
data.append({
"id": i,
"text": doc,
"vector": self.model.encode(doc).tolist()
})
self.table = self.db.create_table("docs", data, mode="overwrite")
def search(self, query, k=5):
"""Search for relevant documents"""
query_vector = self.model.encode(query)
results = self.table.search(query_vector).limit(k).to_list()
return [r['text'] for r in results]
# Usage
rag = SimpleRAG()
rag.add_documents([
"Python is a programming language",
"Machine learning uses algorithms",
"Embeddings represent text as vectors"
])
results = rag.search("What is Python?")
print(results[0]) # "Python is a programming language"
Common Pitfalls
1. Using Wrong Model for Domain
Generic models struggle with specialized terminology:
- Medical: "MI" (myocardial infarction vs. Michigan)
- Legal: "consideration" (legal term vs. general meaning)
Solution: Use domain-specific models or fine-tune.
2. Ignoring Chunk Size
Embedding entire documents loses granularity:
# Bad: Embed whole document
doc_embedding = model.encode(entire_document) # Too broad
# Good: Embed chunks
chunks = split_document(document, chunk_size=512)
chunk_embeddings = [model.encode(chunk) for chunk in chunks]
3. Not Normalizing Queries
Inconsistent query formatting affects results:
# Inconsistent
query1 = "how do i reset password?"
query2 = "How do I reset my password?"
# Better: Normalize
def normalize_query(q):
return q.lower().strip()
Performance Considerations
Batch Processing
Process multiple texts together for speed:
# Slow: One at a time
for text in texts:
embedding = model.encode(text)
# Fast: Batch processing
embeddings = model.encode(texts, batch_size=32)
Caching
Cache embeddings for frequently accessed documents:
import pickle
# Save embeddings
with open('embeddings.pkl', 'wb') as f:
pickle.dump(embeddings, f)
# Load embeddings
with open('embeddings.pkl', 'rb') as f:
embeddings = pickle.load(f)
Next Steps
- Choosing Embedding Models - Select the right model for your use case
- Fine-Tuning Embeddings - Improve performance with domain-specific training
- Multilingual Embeddings - Handle multiple languages