Demonstration comparing BM25 token matching with sentence-transformer semantic retrieval and a weighted hybrid ranker, with runnable Jupyter notebook, code, and tuning guidance.
In this lesson we compare classic BM25 retrieval with true semantic search (using sentence-transformers), then build a simple hybrid ranker that blends both signals. This example is compact, corrected, and runnable in a Jupyter-friendly notebook.BM25 is a token-statistics method that excels at matching exact terms and weighting important tokens, but it can miss synonyms, paraphrases, and deeper meaning. A bi-encoder (sentence-transformers) maps queries and documents to vectors and uses cosine similarity to retrieve semantically similar items. A hybrid approach combines both strengths and often yields more robust retrieval for real applications.Quick links:
from rank_bm25 import BM25Okapifrom sentence_transformers import SentenceTransformer, utilimport numpy as npdocs = [ "Enable two-factor authentication (2FA) in your account settings to add an extra security step.", "HbA1c measures long-term glucose; talk to your physician about tests for glycated hemoglobin.", "Our PTO policy covers paid time off for vacations and sick leave.", "How to fix engine misfires caused by bad spark plugs.", "Kubernetes Ingress configuration for path-based routing.", "Configure MFA with authenticator apps.", "Doctor appointment scheduling policy."]queries = ["How do I set up 2FA?", "What does HbA1c mean?", "sick leave policy?"]
# Prepare BM25tokenized_corpus = [d.lower().split() for d in docs]bm25 = BM25Okapi(tokenized_corpus)# Choose a model (fast general-purpose)model = SentenceTransformer('all-MiniLM-L6-v2')doc_emb = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)
BM25 relies on token overlap and term importance. It may favor documents that share surface words with the query (even if the meaning differs).
SentenceTransformer returns semantically similar results by embedding meaning, so it better handles synonyms and paraphrases (e.g., mapping “2FA” to “two-factor authentication”).
The hybrid approach blends both signals using alpha. Values:
alpha > 0.5 favors semantic matching.
alpha < 0.5 favors BM25.
Normalizing both score arrays to [0,1] before combining allows a simple weighted linear blend that is robust to differing score scales. The small epsilon (1e-9) prevents division-by-zero for degenerate score distributions.
QUERY: How do I set up 2FA?============================================================BM25 top-k: [3] 1.487 How to fix engine misfires caused by bad spark plugs. [0] 0.000 Enable two-factor authentication (2FA) in your account settings to add an extra security step. [5] 0.000 Configure MFA with authenticator apps.Semantic (SentenceTransformer) top-k: [0] 0.689 Enable two-factor authentication (2FA) in your account settings to add an extra security step. [5] 0.583 Configure MFA with authenticator apps. [3] 0.064 How to fix engine misfires caused by bad spark plugs.Hybrid (BM25 + Semantic) top-k (alpha=0.8): [0] 0.720 Enable two-factor authentication (2FA) in your account settings to add an extra security step. [5] 0.670 Configure MFA with authenticator apps. [3] 0.574 How to fix engine misfires caused by bad spark plugs.
Tune alpha and experiment with different sentence-transformer models (for example, all-MiniLM-L6-v2 for speed or multi-qa-MiniLM-L6-cos-v1 for QA-style retrieval). Use a labeled validation set (queries with known correct docs) to measure precision/recall and choose alpha and model for your dataset. If BM25 returns many identical scores (common in small corpora), the semantic signal typically helps; if semantic matching over-generalizes in your domain, increase BM25 weight.
# Example: use a QA-tuned modelmodel = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')doc_emb = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)# Re-run show_results with the new embeddings:for q in queries: show_results(q, k=3, alpha=0.8)