Hybrid Search Techniques

Vector search is powerful for semantic understanding, but keyword search excels at exact matches. Hybrid search combines both for superior retrieval. Let's learn how.

The Problem with Pure Vector Search

What Vector Search Misses

python

from openai import OpenAI
import numpy as np

client = OpenAI(api_key="your-api-key")

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return np.array(response.data[0].embedding)

# Document collection
documents = [
    "Python 3.11 introduces new features",
    "The Python programming language is versatile",
    "JavaScript ES2023 has async improvements",
    "Model 3.11 is Tesla's electric vehicle"
]

# Query: Find Python 3.11 information
query = "Python 3.11"

query_emb = get_embedding(query)
doc_embs = [get_embedding(doc) for doc in documents]

# Calculate similarities
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

results = []
for doc, emb in zip(documents, doc_embs):
    sim = cosine_similarity(query_emb, emb)
    results.append((doc, sim))

results.sort(key=lambda x: x[1], reverse=True)

print("Vector search results:")
for doc, score in results:
    print(f"{score:.4f} | {doc}")

# Output (problem!):
# 0.8234 | The Python programming language is versatile
# 0.7892 | Python 3.11 introduces new features  ← Should be #1!
# 0.6456 | JavaScript ES2023 has async improvements
# 0.5123 | Model 3.11 is Tesla's electric vehicle

# "Python" (general) scores higher than "Python 3.11" (specific)!

Vector Search Limitation: Embeddings capture semantic meaning but may miss exact matches, version numbers, product codes, or proper nouns. "Python" and "Python 3.11" are semantically similar, but the specific version matters!

BM25: Best Match 25

BM25 (Best Matching 25) is a ranking function for keyword search, the gold standard for exact matching.

BM25 (Best Matching 25): A ranking algorithm for keyword search that scores documents based on term frequency, document frequency, and document length. It excels at finding exact matches and is the industry standard for lexical search.

How BM25 Works

python

"""
BM25 scores documents based on:

1. Term Frequency (TF): How often does the query term appear?
2. Inverse Document Frequency (IDF): How rare is the term?
3. Document Length Normalization: Penalize very long documents

Formula:
score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D| / avgdl))

Where:
- D: Document
- Q: Query
- qi: Query term i
- f(qi, D): Frequency of qi in D
- k1: Term frequency saturation parameter (default: 1.5)
- b: Length normalization parameter (default: 0.75)
- |D|: Document length
- avgdl: Average document length
"""

Implementing BM25

python

# pip install rank-bm25

from rank_bm25 import BM25Okapi
import numpy as np

# Documents
documents = [
    "Python 3.11 introduces new features",
    "The Python programming language is versatile",
    "JavaScript ES2023 has async improvements",
    "Model 3.11 is Tesla's electric vehicle"
]

# Tokenize documents (simple whitespace)
tokenized_docs = [doc.lower().split() for doc in documents]

# Create BM25 index
bm25 = BM25Okapi(tokenized_docs)

# Query
query = "Python 3.11"
tokenized_query = query.lower().split()

# Get BM25 scores
scores = bm25.get_scores(tokenized_query)

# Rank results
results = [(doc, score) for doc, score in zip(documents, scores)]
results.sort(key=lambda x: x[1], reverse=True)

print("BM25 search results:")
for doc, score in results:
    print(f"{score:.4f} | {doc}")

# Output (better!):
# 2.1234 | Python 3.11 introduces new features  ← Correct! Exact match
# 0.8456 | The Python programming language is versatile
# 0.0000 | JavaScript ES2023 has async improvements
# 0.3421 | Model 3.11 is Tesla's electric vehicle

BM25 Tuning Parameters

python

from rank_bm25 import BM25Okapi

# Default parameters
bm25_default = BM25Okapi(tokenized_docs)

# Tuned parameters
bm25_tuned = BM25Okapi(
    tokenized_docs,
    k1=1.5,  # Term frequency saturation (higher = more weight on term frequency)
    b=0.75   # Length normalization (0 = no normalization, 1 = full normalization)
)

# k1 examples:
# - k1 = 0.5: Less weight on repeated terms (good for short docs)
# - k1 = 1.5: Balanced (default, works for most cases)
# - k1 = 2.0: More weight on repeated terms (good for long docs)

# b examples:
# - b = 0.0: No length normalization (all docs treated equally)
# - b = 0.75: Standard normalization (default)
# - b = 1.0: Full normalization (heavily penalize long docs)

def compare_parameters():
    query = "Python"
    tokenized_query = query.lower().split()

    configs = [
        (BM25Okapi(tokenized_docs, k1=0.5, b=0.75), "k1=0.5, b=0.75"),
        (BM25Okapi(tokenized_docs, k1=1.5, b=0.75), "k1=1.5, b=0.75 (default)"),
        (BM25Okapi(tokenized_docs, k1=2.0, b=0.75), "k1=2.0, b=0.75"),
    ]

    for bm25, label in configs:
        scores = bm25.get_scores(tokenized_query)
        print(f"\n{label}:")
        for doc, score in zip(documents, scores):
            print(f"  {score:.4f} | {doc}")

compare_parameters()

Hybrid Search: Combining Vector + BM25

Simple Score Fusion

python

from rank_bm25 import BM25Okapi
from openai import OpenAI
import numpy as np

client = OpenAI(api_key="your-api-key")

class HybridSearch:
    def __init__(self, documents, alpha=0.5):
        """
        Initialize hybrid search

        Args:
            documents: List of text documents
            alpha: Balance between vector (1.0) and BM25 (0.0)
                   0.0 = pure BM25, 1.0 = pure vector, 0.5 = balanced
        """
        self.documents = documents
        self.alpha = alpha

        # Setup BM25
        self.tokenized_docs = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(self.tokenized_docs)

        # Setup vector search
        print("Creating embeddings...")
        response = client.embeddings.create(
            input=documents,
            model="text-embedding-3-small"
        )
        self.embeddings = [np.array(item.embedding) for item in response.data]

    def search(self, query, top_k=5):
        """
        Hybrid search combining BM25 and vector search

        Args:
            query: Search query
            top_k: Number of results to return

        Returns:
            List of (document, score) tuples
        """
        # BM25 scores
        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)

        # Vector scores
        query_response = client.embeddings.create(
            input=query,
            model="text-embedding-3-small"
        )
        query_embedding = np.array(query_response.data[0].embedding)

        vector_scores = []
        for emb in self.embeddings:
            sim = np.dot(query_embedding, emb) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(emb)
            )
            vector_scores.append(sim)

        # Normalize scores to 0-1 range
        def normalize(scores):
            scores = np.array(scores)
            min_s, max_s = scores.min(), scores.max()
            if max_s == min_s:
                return np.ones_like(scores) * 0.5
            return (scores - min_s) / (max_s - min_s)

        bm25_scores_norm = normalize(bm25_scores)
        vector_scores_norm = normalize(vector_scores)

        # Combine scores
        hybrid_scores = (
            self.alpha * vector_scores_norm +
            (1 - self.alpha) * bm25_scores_norm
        )

        # Rank results
        results = [
            (doc, score)
            for doc, score in zip(self.documents, hybrid_scores)
        ]
        results.sort(key=lambda x: x[1], reverse=True)

        return results[:top_k]

# Example
documents = [
    "Python 3.11 introduces new features including exception groups",
    "The Python programming language is versatile and easy to learn",
    "JavaScript ES2023 has async improvements and new array methods",
    "Model 3.11 is Tesla's electric vehicle with long range",
    "Python is used for web development, data science, and automation"
]

# Test different alpha values
for alpha in [0.0, 0.3, 0.5, 0.7, 1.0]:
    print(f"\n{'='*60}")
    print(f"Alpha = {alpha} ({'pure BM25' if alpha == 0 else 'pure vector' if alpha == 1 else 'hybrid'})")
    print('='*60)

    search = HybridSearch(documents, alpha=alpha)
    results = search.search("Python 3.11", top_k=3)

    for doc, score in results:
        print(f"{score:.4f} | {doc}")

Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF): A method for combining multiple search rankings that uses rank positions rather than scores. More robust than weighted fusion because it doesn't require score normalization.

python

def reciprocal_rank_fusion(rankings_list, k=60):
    """
    Fuse multiple rankings using Reciprocal Rank Fusion

    Args:
        rankings_list: List of rankings, each is a list of (doc_id, score) tuples
        k: Constant for RRF formula (default: 60)

    Returns:
        Fused ranking
    """
    # Calculate RRF scores
    rrf_scores = {}

    for rankings in rankings_list:
        for rank, (doc_id, _) in enumerate(rankings):
            if doc_id not in rrf_scores:
                rrf_scores[doc_id] = 0
            # RRF formula: 1 / (k + rank)
            rrf_scores[doc_id] += 1 / (k + rank + 1)

    # Sort by RRF score
    fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return fused

# Example
class HybridSearchRRF:
    def __init__(self, documents):
        self.documents = documents
        self.doc_ids = list(range(len(documents)))

        # Setup BM25
        self.tokenized_docs = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(self.tokenized_docs)

        # Setup vector search
        response = client.embeddings.create(
            input=documents,
            model="text-embedding-3-small"
        )
        self.embeddings = [np.array(item.embedding) for item in response.data]

    def search(self, query, top_k=5):
        """Search using RRF fusion"""

        # BM25 ranking
        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        bm25_ranking = [
            (doc_id, score)
            for doc_id, score in enumerate(bm25_scores)
        ]
        bm25_ranking.sort(key=lambda x: x[1], reverse=True)

        # Vector ranking
        query_response = client.embeddings.create(
            input=query,
            model="text-embedding-3-small"
        )
        query_embedding = np.array(query_response.data[0].embedding)

        vector_scores = []
        for doc_id, emb in enumerate(self.embeddings):
            sim = np.dot(query_embedding, emb) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(emb)
            )
            vector_scores.append((doc_id, sim))

        vector_scores.sort(key=lambda x: x[1], reverse=True)

        # Fuse rankings
        fused = reciprocal_rank_fusion([bm25_ranking, vector_scores])

        # Convert to documents
        results = [
            (self.documents[doc_id], score)
            for doc_id, score in fused[:top_k]
        ]

        return results

# Example
search = HybridSearchRRF(documents)
results = search.search("Python 3.11", top_k=3)

print("RRF Hybrid Search Results:")
for doc, score in results:
    print(f"{score:.4f} | {doc}")

RRF vs Weighted Fusion: RRF is more robust because it doesn't require score normalization and handles cases where one system has no relevant results. Use RRF when you're unsure about the optimal alpha parameter.

Re-ranking

After initial retrieval, use a more powerful model to re-rank results.

Re-ranking: A two-stage retrieval process where fast initial search retrieves many candidates, then a slower but more accurate model re-scores them. Balances speed and quality in search systems.

Cross-Encoder Re-ranking

Cross-Encoder: A neural model that jointly encodes query and document pairs to produce a relevance score. More accurate than bi-encoders but slower, making them ideal for re-ranking top candidates.

python

# pip install sentence-transformers

from sentence_transformers import CrossEncoder

class HybridSearchWithReranking:
    def __init__(self, documents):
        self.documents = documents

        # Setup hybrid search (BM25 + vector)
        self.tokenized_docs = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(self.tokenized_docs)

        response = client.embeddings.create(
            input=documents,
            model="text-embedding-3-small"
        )
        self.embeddings = [np.array(item.embedding) for item in response.data]

        # Setup cross-encoder for re-ranking
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def search(self, query, top_k=5, rerank_top_n=20):
        """
        Hybrid search with re-ranking

        Args:
            query: Search query
            top_k: Final number of results
            rerank_top_n: Number of candidates to re-rank

        Returns:
            Re-ranked results
        """
        # Step 1: Hybrid retrieval (get more candidates)
        # BM25 scores
        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)

        # Vector scores
        query_response = client.embeddings.create(
            input=query,
            model="text-embedding-3-small"
        )
        query_embedding = np.array(query_response.data[0].embedding)

        vector_scores = []
        for emb in self.embeddings:
            sim = np.dot(query_embedding, emb) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(emb)
            )
            vector_scores.append(sim)

        # Normalize and combine
        def normalize(scores):
            scores = np.array(scores)
            min_s, max_s = scores.min(), scores.max()
            if max_s == min_s:
                return np.ones_like(scores) * 0.5
            return (scores - min_s) / (max_s - min_s)

        hybrid_scores = 0.5 * normalize(bm25_scores) + 0.5 * normalize(vector_scores)

        # Get top candidates
        candidates = [
            (doc, score)
            for doc, score in zip(self.documents, hybrid_scores)
        ]
        candidates.sort(key=lambda x: x[1], reverse=True)
        candidates = candidates[:rerank_top_n]

        # Step 2: Re-rank candidates with cross-encoder
        pairs = [[query, doc] for doc, _ in candidates]
        rerank_scores = self.reranker.predict(pairs)

        # Combine
        reranked = [
            (doc, score)
            for (doc, _), score in zip(candidates, rerank_scores)
        ]
        reranked.sort(key=lambda x: x[1], reverse=True)

        return reranked[:top_k]

# Example
documents = [
    "Python 3.11 introduces exception groups and performance improvements",
    "Learn Python programming from scratch with this comprehensive guide",
    "JavaScript and Python are popular programming languages",
    "The Python 3.11 release notes detail new features",
    "Model 3.11 is a Tesla vehicle with advanced autopilot",
    "Python best practices for writing clean, maintainable code",
]

search = HybridSearchWithReranking(documents)

print("Without re-ranking (hybrid only):")
# Simulate without re-ranking by limiting rerank_top_n
results_no_rerank = search.search("Python 3.11 features", top_k=3, rerank_top_n=3)
for doc, score in results_no_rerank:
    print(f"{score:.4f} | {doc[:60]}...")

print("\nWith re-ranking:")
results_with_rerank = search.search("Python 3.11 features", top_k=3, rerank_top_n=6)
for doc, score in results_with_rerank:
    print(f"{score:.4f} | {doc[:60]}...")

LLM-based Re-ranking

python

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def llm_rerank(query: str, documents: list[str], top_k: int = 5) -> list[tuple]:
    """
    Use LLM to re-rank documents

    Args:
        query: User query
        documents: List of candidate documents
        top_k: Number of results to return

    Returns:
        Re-ranked documents
    """
    # Create prompt
    docs_text = "\n".join([
        f"{i+1}. {doc}"
        for i, doc in enumerate(documents)
    ])

    prompt = f"""Given the query: "{query}"

Rank the following documents from most to least relevant.
Return ONLY the numbers in order, comma-separated (e.g., "3,1,4,2").

Documents:
{docs_text}

Ranking (most relevant first):"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    # Parse ranking
    ranking_text = response.choices[0].message.content.strip()

    try:
        ranking = [int(x.strip()) - 1 for x in ranking_text.split(",")]  # Convert to 0-indexed

        # Reorder documents
        reranked = [
            (documents[idx], len(documents) - rank)  # Higher rank = higher score
            for rank, idx in enumerate(ranking)
            if idx < len(documents)
        ]

        return reranked[:top_k]

    except (ValueError, IndexError):
        # Fallback: return original order
        return [(doc, len(documents) - i) for i, doc in enumerate(documents[:top_k])]

# Example
documents = [
    "Python programming basics",
    "Python 3.11 new features and improvements",
    "JavaScript async programming",
    "Python 3.11 release notes",
]

query = "What's new in Python 3.11?"

reranked = llm_rerank(query, documents, top_k=3)

print("LLM Re-ranked results:")
for doc, score in reranked:
    print(f"{score} | {doc}")

Re-ranking Strategy: Use fast retrieval (BM25/vector) to get 20-50 candidates, then re-rank with slower but more accurate models (cross-encoder or LLM). This balances speed and quality.

Implementing Hybrid Search with Weaviate

Weaviate has built-in hybrid search:

python

import weaviate
from weaviate.classes.config import Configure, Property, DataType

# Connect to Weaviate
client = weaviate.connect_to_local()

# Create collection
collection_name = "HybridDocs"

if client.collections.exists(collection_name):
    client.collections.delete(collection_name)

collection = client.collections.create(
    name=collection_name,
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai()
)

# Insert documents
documents = [
    {"text": "Python 3.11 introduces exception groups", "category": "python"},
    {"text": "Python programming is versatile", "category": "python"},
    {"text": "JavaScript ES2023 async improvements", "category": "javascript"},
    {"text": "Python best practices guide", "category": "python"},
]

collection = client.collections.get(collection_name)

for doc in documents:
    collection.data.insert(properties=doc)

print("Documents inserted")

# Hybrid search with different alpha values
for alpha in [0.0, 0.5, 1.0]:
    print(f"\n{'='*60}")
    print(f"Alpha = {alpha} ({'BM25' if alpha == 0 else 'vector' if alpha == 1 else 'hybrid'})")
    print('='*60)

    response = collection.query.hybrid(
        query="Python 3.11",
        alpha=alpha,  # 0 = BM25, 1 = vector, 0.5 = balanced
        limit=3,
        return_properties=["text", "category"]
    )

    for obj in response.objects:
        print(f"- {obj.properties['text']}")

client.close()

Best Practices

1. Choose the Right Alpha

python

def tune_alpha(queries, documents, relevance_labels):
    """
    Find optimal alpha by testing on labeled data

    Args:
        queries: List of test queries
        documents: List of documents
        relevance_labels: Dict mapping (query_idx, doc_idx) -> relevance score

    Returns:
        Optimal alpha value
    """
    best_alpha = 0.5
    best_score = 0

    for alpha in np.arange(0, 1.1, 0.1):
        search = HybridSearch(documents, alpha=alpha)

        total_score = 0
        for q_idx, query in enumerate(queries):
            results = search.search(query, top_k=5)

            # Calculate score (e.g., NDCG)
            for rank, (doc, _) in enumerate(results):
                doc_idx = documents.index(doc)
                relevance = relevance_labels.get((q_idx, doc_idx), 0)
                total_score += relevance / (rank + 1)

        if total_score > best_score:
            best_score = total_score
            best_alpha = alpha

    return best_alpha

2. Use Re-ranking Selectively

python

# Re-ranking is expensive - use only when needed

def smart_search(query, documents, use_reranking=None):
    """
    Automatically decide whether to use re-ranking

    Args:
        query: Search query
        documents: Document collection
        use_reranking: None (auto), True, or False
    """
    # Get initial hybrid results
    search = HybridSearch(documents)
    results = search.search(query, top_k=10)

    # Auto-decide re-ranking
    if use_reranking is None:
        # If top scores are close, use re-ranking
        scores = [score for _, score in results[:5]]
        score_variance = np.var(scores)

        use_reranking = score_variance < 0.01  # Low variance = unclear ranking

    if use_reranking:
        # Apply re-ranking
        reranked_search = HybridSearchWithReranking(documents)
        return reranked_search.search(query, top_k=5)
    else:
        return results[:5]

3. Pre-process for Better BM25

python

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download stopwords: nltk.download('stopwords')

def preprocess_for_bm25(text: str) -> list[str]:
    """
    Preprocess text for better BM25 matching

    Steps:
    1. Lowercase
    2. Remove special characters
    3. Remove stopwords (optional)
    4. Stem words (optional)

    Args:
        text: Input text

    Returns:
        List of processed tokens
    """
    # Lowercase
    text = text.lower()

    # Remove special characters, keep alphanumeric and spaces
    text = re.sub(r'[^a-z0-9\s]', '', text)

    # Tokenize
    tokens = text.split()

    # Remove stopwords (optional - can hurt for short queries)
    # stop_words = set(stopwords.words('english'))
    # tokens = [t for t in tokens if t not in stop_words]

    # Stem (optional - can help with variants like "running" vs "run")
    # stemmer = PorterStemmer()
    # tokens = [stemmer.stem(t) for t in tokens]

    return tokens

# Example
text = "Python 3.11 introduces amazing new features!"
processed = preprocess_for_bm25(text)
print(processed)
# Output: ['python', '311', 'introduces', 'amazing', 'new', 'features']

Preprocessing Trade-off: Heavy preprocessing (stemming, stopword removal) can improve BM25 but may hurt exact matching. Test with your specific use case. For technical content (version numbers, code), minimal preprocessing is often best.

Summary

Hybrid search combines the best of both worlds:

Method	Strengths	Use For
BM25	Exact matches, version numbers, proper nouns	Precise queries
Vector	Semantic meaning, paraphrasing, concepts	Conceptual queries
Hybrid	Both precision and recall	Most applications
Re-ranking	Highest quality, slower	Top candidates only

Recommended Pipeline:

Retrieval: Hybrid search (BM25 + vector) → Get 20-50 candidates
Re-ranking: Cross-encoder or LLM → Refine to top 5-10
Generation: Pass to LLM for final response

Key parameters:

alpha: 0.3-0.7 works for most cases (tune on your data)
Re-rank top-N: 20-50 candidates
Final top-K: 3-10 results

Hybrid search significantly improves RAG quality - always prefer it over pure vector search.