Hybrid Search Techniques
Vector search is powerful for semantic understanding, but keyword search excels at exact matches. Hybrid search combines both for superior retrieval. Let's learn how.
The Problem with Pure Vector Search
What Vector Search Misses
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="your-api-key")
def get_embedding(text):
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return np.array(response.data[0].embedding)
# Document collection
documents = [
"Python 3.11 introduces new features",
"The Python programming language is versatile",
"JavaScript ES2023 has async improvements",
"Model 3.11 is Tesla's electric vehicle"
]
# Query: Find Python 3.11 information
query = "Python 3.11"
query_emb = get_embedding(query)
doc_embs = [get_embedding(doc) for doc in documents]
# Calculate similarities
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
results = []
for doc, emb in zip(documents, doc_embs):
sim = cosine_similarity(query_emb, emb)
results.append((doc, sim))
results.sort(key=lambda x: x[1], reverse=True)
print("Vector search results:")
for doc, score in results:
print(f"{score:.4f} | {doc}")
# Output (problem!):
# 0.8234 | The Python programming language is versatile
# 0.7892 | Python 3.11 introduces new features ← Should be #1!
# 0.6456 | JavaScript ES2023 has async improvements
# 0.5123 | Model 3.11 is Tesla's electric vehicle
# "Python" (general) scores higher than "Python 3.11" (specific)!
Vector Search Limitation: Embeddings capture semantic meaning but may miss exact matches, version numbers, product codes, or proper nouns. "Python" and "Python 3.11" are semantically similar, but the specific version matters!
BM25: Best Match 25
BM25 (Best Matching 25) is a ranking function for keyword search, the gold standard for exact matching.
BM25 (Best Matching 25): A ranking algorithm for keyword search that scores documents based on term frequency, document frequency, and document length. It excels at finding exact matches and is the industry standard for lexical search.
How BM25 Works
"""
BM25 scores documents based on:
1. Term Frequency (TF): How often does the query term appear?
2. Inverse Document Frequency (IDF): How rare is the term?
3. Document Length Normalization: Penalize very long documents
Formula:
score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D| / avgdl))
Where:
- D: Document
- Q: Query
- qi: Query term i
- f(qi, D): Frequency of qi in D
- k1: Term frequency saturation parameter (default: 1.5)
- b: Length normalization parameter (default: 0.75)
- |D|: Document length
- avgdl: Average document length
"""
Implementing BM25
# pip install rank-bm25
from rank_bm25 import BM25Okapi
import numpy as np
# Documents
documents = [
"Python 3.11 introduces new features",
"The Python programming language is versatile",
"JavaScript ES2023 has async improvements",
"Model 3.11 is Tesla's electric vehicle"
]
# Tokenize documents (simple whitespace)
tokenized_docs = [doc.lower().split() for doc in documents]
# Create BM25 index
bm25 = BM25Okapi(tokenized_docs)
# Query
query = "Python 3.11"
tokenized_query = query.lower().split()
# Get BM25 scores
scores = bm25.get_scores(tokenized_query)
# Rank results
results = [(doc, score) for doc, score in zip(documents, scores)]
results.sort(key=lambda x: x[1], reverse=True)
print("BM25 search results:")
for doc, score in results:
print(f"{score:.4f} | {doc}")
# Output (better!):
# 2.1234 | Python 3.11 introduces new features ← Correct! Exact match
# 0.8456 | The Python programming language is versatile
# 0.0000 | JavaScript ES2023 has async improvements
# 0.3421 | Model 3.11 is Tesla's electric vehicle
BM25 Tuning Parameters
from rank_bm25 import BM25Okapi
# Default parameters
bm25_default = BM25Okapi(tokenized_docs)
# Tuned parameters
bm25_tuned = BM25Okapi(
tokenized_docs,
k1=1.5, # Term frequency saturation (higher = more weight on term frequency)
b=0.75 # Length normalization (0 = no normalization, 1 = full normalization)
)
# k1 examples:
# - k1 = 0.5: Less weight on repeated terms (good for short docs)
# - k1 = 1.5: Balanced (default, works for most cases)
# - k1 = 2.0: More weight on repeated terms (good for long docs)
# b examples:
# - b = 0.0: No length normalization (all docs treated equally)
# - b = 0.75: Standard normalization (default)
# - b = 1.0: Full normalization (heavily penalize long docs)
def compare_parameters():
query = "Python"
tokenized_query = query.lower().split()
configs = [
(BM25Okapi(tokenized_docs, k1=0.5, b=0.75), "k1=0.5, b=0.75"),
(BM25Okapi(tokenized_docs, k1=1.5, b=0.75), "k1=1.5, b=0.75 (default)"),
(BM25Okapi(tokenized_docs, k1=2.0, b=0.75), "k1=2.0, b=0.75"),
]
for bm25, label in configs:
scores = bm25.get_scores(tokenized_query)
print(f"\n{label}:")
for doc, score in zip(documents, scores):
print(f" {score:.4f} | {doc}")
compare_parameters()
Hybrid Search: Combining Vector + BM25
Simple Score Fusion
from rank_bm25 import BM25Okapi
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="your-api-key")
class HybridSearch:
def __init__(self, documents, alpha=0.5):
"""
Initialize hybrid search
Args:
documents: List of text documents
alpha: Balance between vector (1.0) and BM25 (0.0)
0.0 = pure BM25, 1.0 = pure vector, 0.5 = balanced
"""
self.documents = documents
self.alpha = alpha
# Setup BM25
self.tokenized_docs = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(self.tokenized_docs)
# Setup vector search
print("Creating embeddings...")
response = client.embeddings.create(
input=documents,
model="text-embedding-3-small"
)
self.embeddings = [np.array(item.embedding) for item in response.data]
def search(self, query, top_k=5):
"""
Hybrid search combining BM25 and vector search
Args:
query: Search query
top_k: Number of results to return
Returns:
List of (document, score) tuples
"""
# BM25 scores
tokenized_query = query.lower().split()
bm25_scores = self.bm25.get_scores(tokenized_query)
# Vector scores
query_response = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = np.array(query_response.data[0].embedding)
vector_scores = []
for emb in self.embeddings:
sim = np.dot(query_embedding, emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(emb)
)
vector_scores.append(sim)
# Normalize scores to 0-1 range
def normalize(scores):
scores = np.array(scores)
min_s, max_s = scores.min(), scores.max()
if max_s == min_s:
return np.ones_like(scores) * 0.5
return (scores - min_s) / (max_s - min_s)
bm25_scores_norm = normalize(bm25_scores)
vector_scores_norm = normalize(vector_scores)
# Combine scores
hybrid_scores = (
self.alpha * vector_scores_norm +
(1 - self.alpha) * bm25_scores_norm
)
# Rank results
results = [
(doc, score)
for doc, score in zip(self.documents, hybrid_scores)
]
results.sort(key=lambda x: x[1], reverse=True)
return results[:top_k]
# Example
documents = [
"Python 3.11 introduces new features including exception groups",
"The Python programming language is versatile and easy to learn",
"JavaScript ES2023 has async improvements and new array methods",
"Model 3.11 is Tesla's electric vehicle with long range",
"Python is used for web development, data science, and automation"
]
# Test different alpha values
for alpha in [0.0, 0.3, 0.5, 0.7, 1.0]:
print(f"\n{'='*60}")
print(f"Alpha = {alpha} ({'pure BM25' if alpha == 0 else 'pure vector' if alpha == 1 else 'hybrid'})")
print('='*60)
search = HybridSearch(documents, alpha=alpha)
results = search.search("Python 3.11", top_k=3)
for doc, score in results:
print(f"{score:.4f} | {doc}")
Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion (RRF): A method for combining multiple search rankings that uses rank positions rather than scores. More robust than weighted fusion because it doesn't require score normalization.
def reciprocal_rank_fusion(rankings_list, k=60):
"""
Fuse multiple rankings using Reciprocal Rank Fusion
Args:
rankings_list: List of rankings, each is a list of (doc_id, score) tuples
k: Constant for RRF formula (default: 60)
Returns:
Fused ranking
"""
# Calculate RRF scores
rrf_scores = {}
for rankings in rankings_list:
for rank, (doc_id, _) in enumerate(rankings):
if doc_id not in rrf_scores:
rrf_scores[doc_id] = 0
# RRF formula: 1 / (k + rank)
rrf_scores[doc_id] += 1 / (k + rank + 1)
# Sort by RRF score
fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return fused
# Example
class HybridSearchRRF:
def __init__(self, documents):
self.documents = documents
self.doc_ids = list(range(len(documents)))
# Setup BM25
self.tokenized_docs = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(self.tokenized_docs)
# Setup vector search
response = client.embeddings.create(
input=documents,
model="text-embedding-3-small"
)
self.embeddings = [np.array(item.embedding) for item in response.data]
def search(self, query, top_k=5):
"""Search using RRF fusion"""
# BM25 ranking
tokenized_query = query.lower().split()
bm25_scores = self.bm25.get_scores(tokenized_query)
bm25_ranking = [
(doc_id, score)
for doc_id, score in enumerate(bm25_scores)
]
bm25_ranking.sort(key=lambda x: x[1], reverse=True)
# Vector ranking
query_response = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = np.array(query_response.data[0].embedding)
vector_scores = []
for doc_id, emb in enumerate(self.embeddings):
sim = np.dot(query_embedding, emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(emb)
)
vector_scores.append((doc_id, sim))
vector_scores.sort(key=lambda x: x[1], reverse=True)
# Fuse rankings
fused = reciprocal_rank_fusion([bm25_ranking, vector_scores])
# Convert to documents
results = [
(self.documents[doc_id], score)
for doc_id, score in fused[:top_k]
]
return results
# Example
search = HybridSearchRRF(documents)
results = search.search("Python 3.11", top_k=3)
print("RRF Hybrid Search Results:")
for doc, score in results:
print(f"{score:.4f} | {doc}")
RRF vs Weighted Fusion: RRF is more robust because it doesn't require score normalization and handles cases where one system has no relevant results. Use RRF when you're unsure about the optimal alpha parameter.
Re-ranking
After initial retrieval, use a more powerful model to re-rank results.
Re-ranking: A two-stage retrieval process where fast initial search retrieves many candidates, then a slower but more accurate model re-scores them. Balances speed and quality in search systems.
Cross-Encoder Re-ranking
Cross-Encoder: A neural model that jointly encodes query and document pairs to produce a relevance score. More accurate than bi-encoders but slower, making them ideal for re-ranking top candidates.
# pip install sentence-transformers
from sentence_transformers import CrossEncoder
class HybridSearchWithReranking:
def __init__(self, documents):
self.documents = documents
# Setup hybrid search (BM25 + vector)
self.tokenized_docs = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(self.tokenized_docs)
response = client.embeddings.create(
input=documents,
model="text-embedding-3-small"
)
self.embeddings = [np.array(item.embedding) for item in response.data]
# Setup cross-encoder for re-ranking
self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def search(self, query, top_k=5, rerank_top_n=20):
"""
Hybrid search with re-ranking
Args:
query: Search query
top_k: Final number of results
rerank_top_n: Number of candidates to re-rank
Returns:
Re-ranked results
"""
# Step 1: Hybrid retrieval (get more candidates)
# BM25 scores
tokenized_query = query.lower().split()
bm25_scores = self.bm25.get_scores(tokenized_query)
# Vector scores
query_response = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = np.array(query_response.data[0].embedding)
vector_scores = []
for emb in self.embeddings:
sim = np.dot(query_embedding, emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(emb)
)
vector_scores.append(sim)
# Normalize and combine
def normalize(scores):
scores = np.array(scores)
min_s, max_s = scores.min(), scores.max()
if max_s == min_s:
return np.ones_like(scores) * 0.5
return (scores - min_s) / (max_s - min_s)
hybrid_scores = 0.5 * normalize(bm25_scores) + 0.5 * normalize(vector_scores)
# Get top candidates
candidates = [
(doc, score)
for doc, score in zip(self.documents, hybrid_scores)
]
candidates.sort(key=lambda x: x[1], reverse=True)
candidates = candidates[:rerank_top_n]
# Step 2: Re-rank candidates with cross-encoder
pairs = [[query, doc] for doc, _ in candidates]
rerank_scores = self.reranker.predict(pairs)
# Combine
reranked = [
(doc, score)
for (doc, _), score in zip(candidates, rerank_scores)
]
reranked.sort(key=lambda x: x[1], reverse=True)
return reranked[:top_k]
# Example
documents = [
"Python 3.11 introduces exception groups and performance improvements",
"Learn Python programming from scratch with this comprehensive guide",
"JavaScript and Python are popular programming languages",
"The Python 3.11 release notes detail new features",
"Model 3.11 is a Tesla vehicle with advanced autopilot",
"Python best practices for writing clean, maintainable code",
]
search = HybridSearchWithReranking(documents)
print("Without re-ranking (hybrid only):")
# Simulate without re-ranking by limiting rerank_top_n
results_no_rerank = search.search("Python 3.11 features", top_k=3, rerank_top_n=3)
for doc, score in results_no_rerank:
print(f"{score:.4f} | {doc[:60]}...")
print("\nWith re-ranking:")
results_with_rerank = search.search("Python 3.11 features", top_k=3, rerank_top_n=6)
for doc, score in results_with_rerank:
print(f"{score:.4f} | {doc[:60]}...")
LLM-based Re-ranking
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
def llm_rerank(query: str, documents: list[str], top_k: int = 5) -> list[tuple]:
"""
Use LLM to re-rank documents
Args:
query: User query
documents: List of candidate documents
top_k: Number of results to return
Returns:
Re-ranked documents
"""
# Create prompt
docs_text = "\n".join([
f"{i+1}. {doc}"
for i, doc in enumerate(documents)
])
prompt = f"""Given the query: "{query}"
Rank the following documents from most to least relevant.
Return ONLY the numbers in order, comma-separated (e.g., "3,1,4,2").
Documents:
{docs_text}
Ranking (most relevant first):"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
# Parse ranking
ranking_text = response.choices[0].message.content.strip()
try:
ranking = [int(x.strip()) - 1 for x in ranking_text.split(",")] # Convert to 0-indexed
# Reorder documents
reranked = [
(documents[idx], len(documents) - rank) # Higher rank = higher score
for rank, idx in enumerate(ranking)
if idx < len(documents)
]
return reranked[:top_k]
except (ValueError, IndexError):
# Fallback: return original order
return [(doc, len(documents) - i) for i, doc in enumerate(documents[:top_k])]
# Example
documents = [
"Python programming basics",
"Python 3.11 new features and improvements",
"JavaScript async programming",
"Python 3.11 release notes",
]
query = "What's new in Python 3.11?"
reranked = llm_rerank(query, documents, top_k=3)
print("LLM Re-ranked results:")
for doc, score in reranked:
print(f"{score} | {doc}")
Re-ranking Strategy: Use fast retrieval (BM25/vector) to get 20-50 candidates, then re-rank with slower but more accurate models (cross-encoder or LLM). This balances speed and quality.
Implementing Hybrid Search with Weaviate
Weaviate has built-in hybrid search:
import weaviate
from weaviate.classes.config import Configure, Property, DataType
# Connect to Weaviate
client = weaviate.connect_to_local()
# Create collection
collection_name = "HybridDocs"
if client.collections.exists(collection_name):
client.collections.delete(collection_name)
collection = client.collections.create(
name=collection_name,
properties=[
Property(name="text", data_type=DataType.TEXT),
Property(name="category", data_type=DataType.TEXT),
],
vectorizer_config=Configure.Vectorizer.text2vec_openai()
)
# Insert documents
documents = [
{"text": "Python 3.11 introduces exception groups", "category": "python"},
{"text": "Python programming is versatile", "category": "python"},
{"text": "JavaScript ES2023 async improvements", "category": "javascript"},
{"text": "Python best practices guide", "category": "python"},
]
collection = client.collections.get(collection_name)
for doc in documents:
collection.data.insert(properties=doc)
print("Documents inserted")
# Hybrid search with different alpha values
for alpha in [0.0, 0.5, 1.0]:
print(f"\n{'='*60}")
print(f"Alpha = {alpha} ({'BM25' if alpha == 0 else 'vector' if alpha == 1 else 'hybrid'})")
print('='*60)
response = collection.query.hybrid(
query="Python 3.11",
alpha=alpha, # 0 = BM25, 1 = vector, 0.5 = balanced
limit=3,
return_properties=["text", "category"]
)
for obj in response.objects:
print(f"- {obj.properties['text']}")
client.close()
Best Practices
1. Choose the Right Alpha
def tune_alpha(queries, documents, relevance_labels):
"""
Find optimal alpha by testing on labeled data
Args:
queries: List of test queries
documents: List of documents
relevance_labels: Dict mapping (query_idx, doc_idx) -> relevance score
Returns:
Optimal alpha value
"""
best_alpha = 0.5
best_score = 0
for alpha in np.arange(0, 1.1, 0.1):
search = HybridSearch(documents, alpha=alpha)
total_score = 0
for q_idx, query in enumerate(queries):
results = search.search(query, top_k=5)
# Calculate score (e.g., NDCG)
for rank, (doc, _) in enumerate(results):
doc_idx = documents.index(doc)
relevance = relevance_labels.get((q_idx, doc_idx), 0)
total_score += relevance / (rank + 1)
if total_score > best_score:
best_score = total_score
best_alpha = alpha
return best_alpha
2. Use Re-ranking Selectively
# Re-ranking is expensive - use only when needed
def smart_search(query, documents, use_reranking=None):
"""
Automatically decide whether to use re-ranking
Args:
query: Search query
documents: Document collection
use_reranking: None (auto), True, or False
"""
# Get initial hybrid results
search = HybridSearch(documents)
results = search.search(query, top_k=10)
# Auto-decide re-ranking
if use_reranking is None:
# If top scores are close, use re-ranking
scores = [score for _, score in results[:5]]
score_variance = np.var(scores)
use_reranking = score_variance < 0.01 # Low variance = unclear ranking
if use_reranking:
# Apply re-ranking
reranked_search = HybridSearchWithReranking(documents)
return reranked_search.search(query, top_k=5)
else:
return results[:5]
3. Pre-process for Better BM25
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# Download stopwords: nltk.download('stopwords')
def preprocess_for_bm25(text: str) -> list[str]:
"""
Preprocess text for better BM25 matching
Steps:
1. Lowercase
2. Remove special characters
3. Remove stopwords (optional)
4. Stem words (optional)
Args:
text: Input text
Returns:
List of processed tokens
"""
# Lowercase
text = text.lower()
# Remove special characters, keep alphanumeric and spaces
text = re.sub(r'[^a-z0-9\s]', '', text)
# Tokenize
tokens = text.split()
# Remove stopwords (optional - can hurt for short queries)
# stop_words = set(stopwords.words('english'))
# tokens = [t for t in tokens if t not in stop_words]
# Stem (optional - can help with variants like "running" vs "run")
# stemmer = PorterStemmer()
# tokens = [stemmer.stem(t) for t in tokens]
return tokens
# Example
text = "Python 3.11 introduces amazing new features!"
processed = preprocess_for_bm25(text)
print(processed)
# Output: ['python', '311', 'introduces', 'amazing', 'new', 'features']
Preprocessing Trade-off: Heavy preprocessing (stemming, stopword removal) can improve BM25 but may hurt exact matching. Test with your specific use case. For technical content (version numbers, code), minimal preprocessing is often best.
Summary
Hybrid search combines the best of both worlds:
| Method | Strengths | Use For |
|---|---|---|
| BM25 | Exact matches, version numbers, proper nouns | Precise queries |
| Vector | Semantic meaning, paraphrasing, concepts | Conceptual queries |
| Hybrid | Both precision and recall | Most applications |
| Re-ranking | Highest quality, slower | Top candidates only |
Recommended Pipeline:
- Retrieval: Hybrid search (BM25 + vector) → Get 20-50 candidates
- Re-ranking: Cross-encoder or LLM → Refine to top 5-10
- Generation: Pass to LLM for final response
Key parameters:
- alpha: 0.3-0.7 works for most cases (tune on your data)
- Re-rank top-N: 20-50 candidates
- Final top-K: 3-10 results
Hybrid search significantly improves RAG quality - always prefer it over pure vector search.