Skip to main content
In this lesson we compare classic BM25 retrieval with true semantic search (using sentence-transformers), then build a simple hybrid ranker that blends both signals. This example is compact, corrected, and runnable in a Jupyter-friendly notebook. BM25 is a token-statistics method that excels at matching exact terms and weighting important tokens, but it can miss synonyms, paraphrases, and deeper meaning. A bi-encoder (sentence-transformers) maps queries and documents to vectors and uses cosine similarity to retrieve semantically similar items. A hybrid approach combines both strengths and often yields more robust retrieval for real applications. Quick links:

Installation

Install the required Python packages (Jupyter-friendly):
!pip install -q sentence-transformers rank-bm25 scikit-learn numpy

Overview: how the demo works

  1. Prepare a small document corpus and example queries.
  2. Tokenize documents for BM25 and initialize the BM25 index.
  3. Encode documents with a sentence-transformer to produce normalized embeddings.
  4. For each query:
    • Get BM25 scores (token overlap / importance).
    • Get semantic scores (cosine similarity with embeddings).
    • Normalize both score vectors to [0,1] and combine them with a weighted linear blend: hybrid = alpha * semantic + (1-alpha) * bm25.
  5. Compare the top-k results for BM25, semantic, and hybrid.

Corpus, queries, and imports

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
import numpy as np

docs = [
    "Enable two-factor authentication (2FA) in your account settings to add an extra security step.",
    "HbA1c measures long-term glucose; talk to your physician about tests for glycated hemoglobin.",
    "Our PTO policy covers paid time off for vacations and sick leave.",
    "How to fix engine misfires caused by bad spark plugs.",
    "Kubernetes Ingress configuration for path-based routing.",
    "Configure MFA with authenticator apps.",
    "Doctor appointment scheduling policy."
]

queries = ["How do I set up 2FA?", "What does HbA1c mean?", "sick leave policy?"]

Prepare BM25 and SentenceTransformer embeddings

# Prepare BM25
tokenized_corpus = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized_corpus)

# Choose a model (fast general-purpose)
model = SentenceTransformer('all-MiniLM-L6-v2')
doc_emb = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)

Helper: show results for a query (BM25, semantic, and weighted hybrid)

def show_results(query, k=3, alpha=0.8):
    """
    Show BM25 top-k, semantic top-k, and hybrid top-k for `query`.
    alpha: semantic weight in [0,1] for hybrid. hybrid = alpha*semantic + (1-alpha)*bm25
    """
    print(f"\nQUERY: {query}\n" + "="*60)
    # BM25 scores
    bm25_scores = np.array(bm25.get_scores(query.lower().split()))
    top_bm25 = np.argsort(-bm25_scores)[:k]
    print("BM25 top-k:")
    for i in top_bm25:
        print(f"  [{i}] {bm25_scores[i]:.3f}  {docs[i]}")

    # Semantic (SentenceTransformer) scores (cosine similarity)
    q_emb = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
    cos = util.cos_sim(q_emb, doc_emb)[0].cpu().numpy()
    top_st = np.argsort(-cos)[:k]
    print("\nSemantic (SentenceTransformer) top-k:")
    for i in top_st:
        print(f"  [{i}] {cos[i]:.3f}  {docs[i]}")

    # Normalize both scores into [0,1] to combine them
    bm25_min, bm25_max = bm25_scores.min(), bm25_scores.max()
    bm25_norm = (bm25_scores - bm25_min) / (bm25_max - bm25_min + 1e-9)

    st_min, st_max = cos.min(), cos.max()
    st_norm = (cos - st_min) / (st_max - st_min + 1e-9)

    # Hybrid: semantic-weighted combination
    hybrid = alpha * st_norm + (1.0 - alpha) * bm25_norm
    top_h = np.argsort(-hybrid)[:k]
    print(f"\nHybrid (BM25 + Semantic) top-k (alpha={alpha}):")
    for i in top_h:
        print(f"  [{i}] {hybrid[i]:.3f}  {docs[i]}")

Run the demo for all queries (default semantic weight alpha=0.8)

for q in queries:
    show_results(q, k=3, alpha=0.8)

Discussion and example behavior

  • BM25 relies on token overlap and term importance. It may favor documents that share surface words with the query (even if the meaning differs).
  • SentenceTransformer returns semantically similar results by embedding meaning, so it better handles synonyms and paraphrases (e.g., mapping “2FA” to “two-factor authentication”).
  • The hybrid approach blends both signals using alpha. Values:
    • alpha > 0.5 favors semantic matching.
    • alpha < 0.5 favors BM25.
  • Normalizing both score arrays to [0,1] before combining allows a simple weighted linear blend that is robust to differing score scales. The small epsilon (1e-9) prevents division-by-zero for degenerate score distributions.
The image shows a Jupyter notebook interface comparing retrieval results using BM25, Semantic (SentenceTransformer), and a hybrid method for different queries related to HbA1c and sick leave policy. The results include top-k entries for each query with associated scores.

Sample (cleaned) output for illustration

QUERY: How do I set up 2FA?
============================================================
BM25 top-k:
  [3] 1.487  How to fix engine misfires caused by bad spark plugs.
  [0] 0.000  Enable two-factor authentication (2FA) in your account settings to add an extra security step.
  [5] 0.000  Configure MFA with authenticator apps.

Semantic (SentenceTransformer) top-k:
  [0] 0.689  Enable two-factor authentication (2FA) in your account settings to add an extra security step.
  [5] 0.583  Configure MFA with authenticator apps.
  [3] 0.064  How to fix engine misfires caused by bad spark plugs.

Hybrid (BM25 + Semantic) top-k (alpha=0.8):
  [0] 0.720  Enable two-factor authentication (2FA) in your account settings to add an extra security step.
  [5] 0.670  Configure MFA with authenticator apps.
  [3] 0.574  How to fix engine misfires caused by bad spark plugs.

Tuning guidance

Tune alpha and experiment with different sentence-transformer models (for example, all-MiniLM-L6-v2 for speed or multi-qa-MiniLM-L6-cos-v1 for QA-style retrieval). Use a labeled validation set (queries with known correct docs) to measure precision/recall and choose alpha and model for your dataset. If BM25 returns many identical scores (common in small corpora), the semantic signal typically helps; if semantic matching over-generalizes in your domain, increase BM25 weight.

Switching the embedding model (example)

# Example: use a QA-tuned model
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
doc_emb = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)

# Re-run show_results with the new embeddings:
for q in queries:
    show_results(q, k=3, alpha=0.8)

Quick comparison

MethodStrengthsWhen to use
BM25Fast, interpretable, strong on exact matches and term importanceShort queries, domain with consistent terminology
Semantic (bi-encoder)Handles synonyms, paraphrase, and deeper meaningNatural-language queries, varied vocabulary
Hybrid (weighted)Combines both signals, tunable with alphaReal-world retrieval where both surface form and meaning matter

Final notes

  • This demo uses a very small toy corpus for illustration. On larger corpora you’ll get more stable BM25 distributions and richer semantic matches.
  • Keep the retrieval pipeline configurable: alpha, k, and model selection should be part of your evaluation loop.
  • Validate hybrid weighting with representative queries and metrics (e.g., recall@k, MRR) before deploying to production.
References and further reading:

Watch Video