Skip to main content
In this lesson we demonstrate two common document retrieval approaches using a small automotive troubleshooting corpus:
  • Keyword search (TF-IDF) with Whoosh
  • Semantic retrieval using SentenceTransformers + cosine similarity
We’ll build a tiny corpus, run both methods on the same query, and compare the ranked results side-by-side so you can see how exact-match ranking (TF-IDF) and meaning-based retrieval (embeddings) differ. Keywords: TF-IDF, semantic search, embeddings, Whoosh, SentenceTransformers, cosine similarity Prerequisites (run once):
pip install whoosh sentence-transformers scikit-learn pandas
Corpus and titles used in this lesson:
docs = [
    "A beginner's guide to engine troubleshooting: check fuel lines, spark, and air intake before replacing parts.",
    "Piston wear can cause loss of compression; regular maintenance and proper lubrication extend engine life.",
    "Valve timing issues often masquerade as rough idle—inspect the timing belt and camshaft alignment.",
    "Motor diagnostics for intermittent power loss: scan error codes, inspect sensors, and test the ignition coil.",
    "Understanding airflow: clogged filters, intake leaks, and MAF sensor failures reduce performance.",
    "Basic oil system checks: pressure light warnings, pump failures, and choosing the right viscosity."
]

titles = [
    "Engine Troubleshooting 101",
    "Piston Wear & Compression",
    "Valve Timing Problems",
    "Motor Diagnostics Checklist",
    "Airflow & Intake Issues",
    "Oil System Basics"
]
SentenceTransformers will download model weights on first use. If you’re running this in a restricted environment, pre-download models or set SENTENCE_TRANSFORMERS_HOME to a writable cache folder. For larger corpora consider batching embeddings to avoid memory spikes.

1) Keyword search with Whoosh (TF-IDF)

Whoosh is a pure-Python search library that indexes documents and computes TF-IDF style relevance under the hood. The example below creates a temporary Whoosh index, adds our documents, and runs a simple keyword query. Whoosh matches query terms and ranks results by term frequency / inverse document frequency.
# whoosh_keyword_search.py
import os
import shutil
import tempfile
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import QueryParser

# Create temp index
index_dir = tempfile.mkdtemp()
schema = Schema(title=ID(stored=True), content=TEXT(stored=True))
ix = index.create_in(index_dir, schema)

# Add documents to index
writer = ix.writer()
for t, d in zip(titles, docs):
    writer.add_document(title=t, content=d)
writer.commit()

# Helper to run a keyword query
def whoosh_query(query_text, top_k=10):
    with ix.searcher() as searcher:
        parser = QueryParser("content", ix.schema)
        q = parser.parse(query_text)
        results = searcher.search(q, limit=top_k)
        return [(r["title"], r.score) for r in results]

# Examples
print("Whoosh results for 'engine troubleshooting':", whoosh_query("engine troubleshooting"))
print("Whoosh results for 'piston':", whoosh_query("piston"))

# Cleanup when finished (uncomment to remove index)
# shutil.rmtree(index_dir)
Example Whoosh outputs (your scores may vary slightly):
Whoosh results for 'engine troubleshooting': [('Engine Troubleshooting 101', 3.688410483089193)]
Whoosh results for 'piston': [('Piston Wear & Compression', 2.110439158172188)]
Notes on Whoosh behavior:
  • Whoosh returns documents where the query terms appear and ranks by TF-IDF-like importance.
  • If a conceptually relevant document doesn’t contain the exact query terms (e.g., “Motor Diagnostics Checklist” for the query “engine troubleshooting”), Whoosh will not return it unless the text contains matching tokens.
This example stores the index in a temporary directory. In production or repeated runs, persist the index directory or rebuild as needed. Remember to clean up temporary files to avoid disk bloat (shutil.rmtree(index_dir)).

2) Semantic search with SentenceTransformers (cosine similarity)

Semantic search embeds documents and queries into a vector space and ranks by vector similarity (cosine). This approach captures conceptual relationships beyond exact token overlap. We use the paraphrase-MiniLM-L6-v2 model for compact, fast embeddings.
# semantic_search.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

# Query used for comparison
query_text = "engine troubleshooting"

# Embed documents and query (normalized embeddings)
doc_embeddings = model.encode(docs, convert_to_numpy=True, normalize_embeddings=True)
query_embedding = model.encode([query_text], convert_to_numpy=True, normalize_embeddings=True)

# Compute cosine similarity and rank (descending)
sims = cosine_similarity(query_embedding, doc_embeddings).flatten()
sem_hits_idx = np.argsort(-sims)
sem_hits = [(titles[i], float(sims[i])) for i in sem_hits_idx]

# Top semantic hits
sem_hits[:6]
Example semantic ranking (scores will vary by model version and environment):
[
  ('Engine Troubleshooting 101', 0.84),
  ('Valve Timing Problems', 0.63),
  ('Motor Diagnostics Checklist', 0.61),
  ('Piston Wear & Compression', 0.46),
  ('Airflow & Intake Issues', 0.42),
  ('Oil System Basics', 0.35)
]
Why semantic search differs:
  • The embedding model captures conceptual similarity, so queries like “engine troubleshooting” will surface “motor diagnostics” and “valve timing” even without exact word overlap.
  • Semantic retrieval improves recall for related documents; TF-IDF provides more precision for literal matches.

3) Compare TF-IDF and Semantic rankings side-by-side

We can combine Whoosh (TF-IDF) hits and SentenceTransformers (cosine similarity) hits into a pandas DataFrame to compare ranks and scores. This lets you directly inspect differences in ordering and the presence/absence of documents in each result set.
# compare_rankings.py
import pandas as pd

# Assume kw_hits is the list returned by the whoosh_query for the given query
kw_hits = whoosh_query(query_text)  # from the Whoosh example above
sem_hits = sem_hits                   # from the semantic example above

# Normalize/pretty print both lists with rank
kw_df = pd.DataFrame(kw_hits, columns=["Title", "TFIDF_Score"])
kw_df["KW_Rank"] = range(1, len(kw_df) + 1)

sem_df = pd.DataFrame(sem_hits, columns=["Title", "CosineSim"])
sem_df["SEM_Rank"] = range(1, len(sem_df) + 1)

# Merge on title to show both ranks together
comparison = pd.merge(sem_df, kw_df, on="Title", how="outer")

# Sort by semantic rank to highlight semantic ordering
comparison_sorted = comparison.sort_values(by="SEM_Rank", na_position="last")

comparison_sorted[["Title", "SEM_Rank", "CosineSim", "KW_Rank", "TFIDF_Score"]]
Sample comparison table output:
                      Title  SEM_Rank  CosineSim  KW_Rank  TFIDF_Score
0  Engine Troubleshooting 101         1       0.84      1.0     3.688410
2      Valve Timing Problems         2       0.63      NaN         NaN
3  Motor Diagnostics Checklist         3       0.61      NaN         NaN
1   Piston Wear & Compression         4       0.46      NaN         NaN
4       Airflow & Intake Issues         5       0.42      NaN         NaN
5              Oil System Basics         6       0.35      NaN         NaN
Performance and practical considerations:
MethodStrengthsTypical Use CasesExample behavior on “engine troubleshooting”
TF-IDF (Whoosh)Fast, interpretable, precise for exact token matchesKeyword search, filtering, small to medium corporaFinds documents that contain the exact words “engine” and “troubleshooting” (high score)
Semantic (Embeddings)Captures conceptual similarity, robust to paraphrasesQA retrieval, recommendation, broader recallReturns “Motor Diagnostics Checklist” and “Valve Timing Problems” even without exact term overlap
Interpretation:
  • Whoosh (TF-IDF) excels at precision for literal queries and is simple to run locally.
  • Semantic search returns documents ranked by conceptual relevance and can surface related material that lacks exact tokens from the query.
  • A hybrid approach often works best: use TF-IDF for exact matches and embeddings to expand recall, or rerank TF-IDF candidates with embeddings for a balance of speed and semantic quality.
Conclusion
  • This lesson illustrated the differences between keyword (TF-IDF) search and semantic retrieval using a small corpus.
  • You can run the provided notebook-style code to experiment with queries, model choice, and ranking strategies.
Links and References

Watch Video