Compares keyword lexical search and sentence-transformer semantic search with code examples, illustrating differences, evaluation, and recommending hybrid retrieval with semantic re-ranking for better recall and precision.
This tutorial compares a classic lexical retriever (TF-IDF / BM25 via Whoosh) with a lightweight semantic retriever using Sentence Transformers. The aim is to demonstrate how semantic search can surface conceptually related documents (for example, documents about “motor diagnostics”) even when the query uses different wording (for example, “engine troubleshooting”).What you’ll learn:
How to run a simple Whoosh keyword search.
How to embed text with a Sentence Transformers model and rank by cosine similarity.
How the two approaches differ in practice and how to compare them side-by-side.
We use a tiny corpus of six short documents and titles to make the difference between keyword and semantic retrieval obvious:
docs = [ "A beginner's guide to engine troubleshooting: check fuel lines, spark, and air intake before replacing parts.", "Piston wear can cause loss of compression; regular maintenance and proper lubrication extend engine life.", "Valve timing issues often masquerade as rough idle—inspect the timing belt and camshaft alignment.", "Motor diagnostics for intermittent power loss: scan error codes, inspect sensors, and test the ignition coil.", "Understanding airflow: clogged filters, intake leaks, and MAF sensor failures reduce performance.", "Basic oil system checks: pressure light warnings, pump failures, and choosing the right viscosity."]titles = [ "Engine Troubleshooting 101", "Piston Wear & Compression", "Valve Timing Problems", "Motor Diagnostics Checklist", "Airflow & Intake Issues", "Oil System Basics"]
Explanation: The lexical retriever returns the exact-match document containing the words “engine troubleshooting”. Documents that convey the same concept but use different words (for example, “motor diagnostics”) do not appear because they lack lexical overlap with the query.
Embed the documents and the query using a Sentence Transformers model (all-MiniLM-L6-v2), then rank documents by cosine similarity.
from sentence_transformers import SentenceTransformerfrom sklearn.metrics.pairwise import cosine_similarityimport numpy as np# Load a small, fast sentence-transformer modelmodel = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")# Embed documents and the querydoc_embeddings = model.encode(docs, normalize_embeddings=True)query_embedding = model.encode([query_text], normalize_embeddings=True)# Cosine similarity scoressims = cosine_similarity(query_embedding, doc_embeddings).flatten()# Rank by semantic similarity (descending)sem_hits_idx = np.argsort(-sims)sem_hits = [(titles[i], float(sims[i])) for i in sem_hits_idx]# Show top 5 semantic hitssem_hits[:5]
Explanation: Semantic search returns several related documents beyond the exact lexical match. Because embeddings capture conceptual similarity, “Motor Diagnostics Checklist” and “Valve Timing Problems” appear as relevant even though they do not share exact wording with the query.
Normalize the results into DataFrames and merge them to compare ranks and scores across both methods. The Whoosh/lexical search may return only exact matches, while the semantic search gives a ranked list for all documents.
import pandas as pd# Normalize/pretty print both lists with rankkw_df = pd.DataFrame(kw_hits, columns=["Title", "TFIDF_Score"])kw_df["KW_Rank"] = range(1, len(kw_df) + 1)sem_df = pd.DataFrame(sem_hits, columns=["Title", "CosineSim"])sem_df["SEM_Rank"] = range(1, len(sem_df) + 1)# Merge on title to show both ranks togethercomparison = pd.merge(sem_df, kw_df, on="Title", how="outer")# Sort by semantic rank to highlight the "meaning" orderingcomparison_sorted = comparison.sort_values(by="SEM_Rank", na_position="last")comparison_sorted[["Title", "SEM_Rank", "CosineSim", "KW_Rank", "TFIDF_Score"]]
Example merged result (conceptual):
Title SEM_Rank CosineSim KW_Rank TFIDF_ScoreEngine Troubleshooting 101 1 0.677818 1 3.688410Valve Timing Problems 2 0.454809 NaN NaNMotor Diagnostics Checklist 3 0.450508 NaN NaNOil System Basics 4 0.321720 NaN NaNPiston Wear & Compression 5 0.239381 NaN NaNAirflow & Intake Issues 6 0.087xxx NaN NaN
The NaNs indicate documents not returned by the lexical keyword search.
Practical pattern: use a hybrid pipeline. First retrieve a broad candidate set quickly (lexical methods like TF-IDF/BM25 or a fast ANN index), then re-rank that subset with a semantic model for better precision. This balances speed, recall, and semantic coverage.
Lexical retrievers like Whoosh excel at exact lexical matches and are low-latency and interpretable.
Semantic retrieval using sentence embeddings recovers conceptually related documents even with different surface wording.
A hybrid system (lexical retrieval for recall + semantic re-ranking for precision) is a practical production pattern that often yields the best results.
You can reuse the notebook snippets above to experiment with different models, retrieval thresholds, or corpora to see how lexical and semantic methods compare in your domain.