RAG Implementation Guide

Overview

Reranking is a two-stage retrieval process that significantly improves search accuracy.

Stage 1 (Retrieval): Retrieve a large set of candidates (e.g., top 50) using a fast method like vector search (Bi-Encoder) or keyword search (BM25).
Stage 2 (Reranking): Re-score these candidates using a more accurate but slower model (Cross-Encoder) and return the top K (e.g., top 5).

Why Reranking?

Vector search (Bi-Encoders) is fast because it pre-computes document embeddings. However, it compresses all document meaning into a single vector, losing nuance.

Cross-Encoders process the query and document together, allowing the model to pay attention to specific interactions between query terms and document text. This makes them much more accurate but computationally expensive to run on the entire database.

The Solution: Use vector search to filter 1M docs down to 50, then use a Cross-Encoder to find the best 5.

Implementation with Sentence Transformers

We can implement a state-of-the-art reranker locally using the sentence-transformers library and a model like ms-marco-MiniLM-L-6-v2.

Prerequisites

pip install sentence-transformers

Code Example

from sentence_transformers import CrossEncoder

# 1. Initialize the Cross-Encoder model
# This model was trained on the MS MARCO passage ranking dataset
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# 2. Define your query and candidate documents
# In a real app, these candidates come from your Vector DB (e.g., LanceDB, Pinecone)
query = "How to process organic waste?"
candidates = [
    "Organic waste can be composted to create nutrient-rich soil.",
    "Plastic waste requires recycling facilities.",
    "Industrial waste management involves chemical treatment.",
    "Composting is a biological process for degrading organic matter.",
    "Nuclear waste storage is a complex safety issue."
]

# 3. Prepare pairs for the model
# The model expects a list of (query, document) tuples
pairs = [[query, doc] for doc in candidates]

# 4. Predict scores
# The model outputs a score for each pair (higher is better)
scores = model.predict(pairs)

# 5. Sort and display results
# Combine docs with scores and sort descending
results = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

print(f"Query: {query}\n")
print("Reranked Results:")
for doc, score in results:
    print(f"[{score:.4f}] {doc}")

Output

Query: How to process organic waste?

Reranked Results:
[8.2145] Composting is a biological process for degrading organic matter.
[7.9321] Organic waste can be composted to create nutrient-rich soil.
[-2.1034] Industrial waste management involves chemical treatment.
[-4.5123] Plastic waste requires recycling facilities.
[-5.6789] Nuclear waste storage is a complex safety issue.

Notice how the model correctly identifies the two relevant documents about composting/organic waste with high positive scores, while irrelevant documents get negative scores.

Choosing a Reranker Model

Model	Speed	Accuracy	Best For
`cross-encoder/ms-marco-MiniLM-L-6-v2`	Fast	Good	General purpose, production latency
`cross-encoder/ms-marco-MiniLM-L-12-v2`	Medium	Better	Slightly better accuracy if latency allows
`BAAI/bge-reranker-base`	Slow	High	High-accuracy requirements
`BAAI/bge-reranker-large`	Very Slow	Very High	Offline processing or complex reasoning

Integration with Vector DB

Here is how to fit it into your RAG pipeline:

def search_with_reranking(query, k=5, fetch_k=50):
    # 1. Fast Retrieval (Bi-Encoder)
    # Get more candidates than you need (fetch_k > k)
    query_vector = embedding_model.encode(query)
    initial_results = vector_db.search(query_vector, k=fetch_k)
    
    # 2. Reranking (Cross-Encoder)
    pairs = [[query, doc.text] for doc in initial_results]
    scores = reranker_model.predict(pairs)
    
    # 3. Sort and Slice
    # Attach scores to results
    for i, doc in enumerate(initial_results):
        doc.score = scores[i]
        
    # Sort by new Cross-Encoder score
    reranked_results = sorted(initial_results, key=lambda x: x.score, reverse=True)
    
    # Return top k
    return reranked_results[:k]

Performance Insights: What to Expect

From Production: "A Cohere re-ranker typically improves performance by 6-12% while adding about 400-500ms of latency."

Trade-offs:

Latency: Adds ~0.5s per query. If your SLA is <200ms, you might need to skip reranking or use a smaller model.
Cost: If using an API (Cohere), costs scale with query volume.
Value: For most RAG systems, the 10% jump in recall is the difference between "useless" and "magic."

Common Questions

"What is the difference between a Bi-Encoder and a Cross-Encoder?"

Bi-Encoder (Vector Search):
- Converts document to numbers once (offline).
- Converts query to numbers once (online).
- Compares numbers (fast).
- Weakness: Can't understand "I love coffee" vs "I hate coffee" (similar vectors).
Cross-Encoder (Reranker):
- Takes Query + Document A, outputs score.
- Takes Query + Document B, outputs score.
- Compares actual sentences (slow).
- Strength: Understands deep semantic nuance and negation.

"Should I fine-tune my re-ranker?"

Yes, if you have specific domain language.

Data: Use the same dataset you'd use for embedding fine-tuning.
Strategy: If Recall@100 is good (95%) but Recall@10 is poor (50%), fine-tuning the re-ranker will yield better ROI than fine-tuning the embedding model.

Next Steps

Hybrid Search - Combine keyword and vector search before reranking.
Context Window Management - Optimize what you send to the LLM.