Skip to main content
This tutorial compares a classic lexical retriever (TF-IDF / BM25 via Whoosh) with a lightweight semantic retriever using Sentence Transformers. The aim is to demonstrate how semantic search can surface conceptually related documents (for example, documents about “motor diagnostics”) even when the query uses different wording (for example, “engine troubleshooting”). What you’ll learn:
  • How to run a simple Whoosh keyword search.
  • How to embed text with a Sentence Transformers model and rank by cosine similarity.
  • How the two approaches differ in practice and how to compare them side-by-side.

Prerequisites

Install the required Python packages (run in a notebook or virtual environment):
%pip install whoosh sentence-transformers scikit-learn pandas
If you plan to run this in a production-like environment, use a dedicated virtual environment and pin package versions to ensure reproducibility.

Corpus: small, focused, intentionally non-overlapping

We use a tiny corpus of six short documents and titles to make the difference between keyword and semantic retrieval obvious:
docs = [
    "A beginner's guide to engine troubleshooting: check fuel lines, spark, and air intake before replacing parts.",
    "Piston wear can cause loss of compression; regular maintenance and proper lubrication extend engine life.",
    "Valve timing issues often masquerade as rough idle—inspect the timing belt and camshaft alignment.",
    "Motor diagnostics for intermittent power loss: scan error codes, inspect sensors, and test the ignition coil.",
    "Understanding airflow: clogged filters, intake leaks, and MAF sensor failures reduce performance.",
    "Basic oil system checks: pressure light warnings, pump failures, and choosing the right viscosity."
]

titles = [
    "Engine Troubleshooting 101",
    "Piston Wear & Compression",
    "Valve Timing Problems",
    "Motor Diagnostics Checklist",
    "Airflow & Intake Issues",
    "Oil System Basics"
]

1) Keyword search (Whoosh — lexical retrieval)

Build a temporary Whoosh index storing title and content fields, then run a keyword query. The example query is "engine troubleshooting".
import tempfile
# Whoosh imports
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import QueryParser

# Create temp index
index_dir = tempfile.mkdtemp()
schema = Schema(title=ID(stored=True), content=TEXT(stored=True))
ix = index.create_in(index_dir, schema)

# Add documents
writer = ix.writer()
for t, d in zip(titles, docs):
    writer.add_document(title=t, content=d)
writer.commit()

# Run a keyword query (lexical retrieval)
query_text = "engine troubleshooting"

with ix.searcher() as searcher:
    parser = QueryParser("content", ix.schema)
    q = parser.parse(query_text)
    results = searcher.search(q, limit=10)
    kw_hits = [(r['title'], float(r.score)) for r in results]

# (Optional) cleanup the temporary index directory when done:
# import shutil
kw_hits
Example lexical output:
[('Engine Troubleshooting 101', 3.688410483089193)]
Explanation: The lexical retriever returns the exact-match document containing the words “engine troubleshooting”. Documents that convey the same concept but use different words (for example, “motor diagnostics”) do not appear because they lack lexical overlap with the query.

2) Semantic retrieval (Sentence Transformers + cosine similarity)

Embed the documents and the query using a Sentence Transformers model (all-MiniLM-L6-v2), then rank documents by cosine similarity.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load a small, fast sentence-transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Embed documents and the query
doc_embeddings = model.encode(docs, normalize_embeddings=True)
query_embedding = model.encode([query_text], normalize_embeddings=True)

# Cosine similarity scores
sims = cosine_similarity(query_embedding, doc_embeddings).flatten()

# Rank by semantic similarity (descending)
sem_hits_idx = np.argsort(-sims)
sem_hits = [(titles[i], float(sims[i])) for i in sem_hits_idx]

# Show top 5 semantic hits
sem_hits[:5]
Example semantic-ranking output:
[('Engine Troubleshooting 101', 0.6778184771537781),
 ('Valve Timing Problems', 0.4548088024031136),
 ('Motor Diagnostics Checklist', 0.4505075285693745),
 ('Oil System Basics', 0.32172026519309644),
 ('Piston Wear & Compression', 0.23938072054235867)]
Explanation: Semantic search returns several related documents beyond the exact lexical match. Because embeddings capture conceptual similarity, “Motor Diagnostics Checklist” and “Valve Timing Problems” appear as relevant even though they do not share exact wording with the query.

3) Side-by-side comparison (TF-IDF vs Semantic)

Normalize the results into DataFrames and merge them to compare ranks and scores across both methods. The Whoosh/lexical search may return only exact matches, while the semantic search gives a ranked list for all documents.
import pandas as pd

# Normalize/pretty print both lists with rank
kw_df = pd.DataFrame(kw_hits, columns=["Title", "TFIDF_Score"])
kw_df["KW_Rank"] = range(1, len(kw_df) + 1)

sem_df = pd.DataFrame(sem_hits, columns=["Title", "CosineSim"])
sem_df["SEM_Rank"] = range(1, len(sem_df) + 1)

# Merge on title to show both ranks together
comparison = pd.merge(sem_df, kw_df, on="Title", how="outer")

# Sort by semantic rank to highlight the "meaning" ordering
comparison_sorted = comparison.sort_values(by="SEM_Rank", na_position="last")

comparison_sorted[["Title", "SEM_Rank", "CosineSim", "KW_Rank", "TFIDF_Score"]]
Example merged result (conceptual):
Title                        SEM_Rank  CosineSim    KW_Rank  TFIDF_Score
Engine Troubleshooting 101   1         0.677818     1        3.688410
Valve Timing Problems        2         0.454809     NaN      NaN
Motor Diagnostics Checklist  3         0.450508     NaN      NaN
Oil System Basics            4         0.321720     NaN      NaN
Piston Wear & Compression    5         0.239381     NaN      NaN
Airflow & Intake Issues      6         0.087xxx     NaN      NaN
The NaNs indicate documents not returned by the lexical keyword search.
Practical pattern: use a hybrid pipeline. First retrieve a broad candidate set quickly (lexical methods like TF-IDF/BM25 or a fast ANN index), then re-rank that subset with a semantic model for better precision. This balances speed, recall, and semantic coverage.

Quick comparison: Lexical vs Semantic vs Hybrid

Retriever TypeStrengthsTypical Use Case
Lexical (TF-IDF / BM25, e.g., Whoosh)Fast, interpretable, exact match rankingKeyword-based search UIs, faceted search, queries where exact terms matter
Semantic (Sentence Transformers + cosine similarity)Finds conceptually related content, robust to paraphraseQA reranking, semantic discovery, content recommendation
Hybrid (lexical retrieval -> semantic re-rank)High recall + high precision, scalableLarge-scale production search pipelines, conversational agents, enterprise search

Tips for experimentation and scaling

  • Try larger Sentence Transformers models for improved semantic quality at the cost of latency.
  • For bigger corpora, use an approximate nearest neighbor (ANN) index (e.g., FAISS, Annoy, HNSW) to retrieve candidate embeddings efficiently.
  • Consider normalization strategies (L2-normalization vs. no normalization) depending on your similarity metric and model outputs.
  • When using Whoosh in production, evaluate BM25 configuration and tokenization for your domain language.

Summary

  • Lexical retrievers like Whoosh excel at exact lexical matches and are low-latency and interpretable.
  • Semantic retrieval using sentence embeddings recovers conceptually related documents even with different surface wording.
  • A hybrid system (lexical retrieval for recall + semantic re-ranking for precision) is a practical production pattern that often yields the best results.
You can reuse the notebook snippets above to experiment with different models, retrieval thresholds, or corpora to see how lexical and semantic methods compare in your domain.

Watch Video

Practice Lab