Demonstrates and compares TF-IDF keyword search with semantic embeddings for document retrieval using Whoosh and SentenceTransformers, highlighting differences and tradeoffs.
In this lesson we demonstrate two common document retrieval approaches using a small automotive troubleshooting corpus:
Keyword search (TF-IDF) with Whoosh
Semantic retrieval using SentenceTransformers + cosine similarity
We’ll build a tiny corpus, run both methods on the same query, and compare the ranked results side-by-side so you can see how exact-match ranking (TF-IDF) and meaning-based retrieval (embeddings) differ.Keywords: TF-IDF, semantic search, embeddings, Whoosh, SentenceTransformers, cosine similarityPrerequisites (run once):
docs = [ "A beginner's guide to engine troubleshooting: check fuel lines, spark, and air intake before replacing parts.", "Piston wear can cause loss of compression; regular maintenance and proper lubrication extend engine life.", "Valve timing issues often masquerade as rough idle—inspect the timing belt and camshaft alignment.", "Motor diagnostics for intermittent power loss: scan error codes, inspect sensors, and test the ignition coil.", "Understanding airflow: clogged filters, intake leaks, and MAF sensor failures reduce performance.", "Basic oil system checks: pressure light warnings, pump failures, and choosing the right viscosity."]titles = [ "Engine Troubleshooting 101", "Piston Wear & Compression", "Valve Timing Problems", "Motor Diagnostics Checklist", "Airflow & Intake Issues", "Oil System Basics"]
SentenceTransformers will download model weights on first use. If you’re running this in a restricted environment, pre-download models or set SENTENCE_TRANSFORMERS_HOME to a writable cache folder. For larger corpora consider batching embeddings to avoid memory spikes.
Whoosh is a pure-Python search library that indexes documents and computes TF-IDF style relevance under the hood. The example below creates a temporary Whoosh index, adds our documents, and runs a simple keyword query. Whoosh matches query terms and ranks results by term frequency / inverse document frequency.
# whoosh_keyword_search.pyimport osimport shutilimport tempfilefrom whoosh import indexfrom whoosh.fields import Schema, TEXT, IDfrom whoosh.qparser import QueryParser# Create temp indexindex_dir = tempfile.mkdtemp()schema = Schema(title=ID(stored=True), content=TEXT(stored=True))ix = index.create_in(index_dir, schema)# Add documents to indexwriter = ix.writer()for t, d in zip(titles, docs): writer.add_document(title=t, content=d)writer.commit()# Helper to run a keyword querydef whoosh_query(query_text, top_k=10): with ix.searcher() as searcher: parser = QueryParser("content", ix.schema) q = parser.parse(query_text) results = searcher.search(q, limit=top_k) return [(r["title"], r.score) for r in results]# Examplesprint("Whoosh results for 'engine troubleshooting':", whoosh_query("engine troubleshooting"))print("Whoosh results for 'piston':", whoosh_query("piston"))# Cleanup when finished (uncomment to remove index)# shutil.rmtree(index_dir)
Example Whoosh outputs (your scores may vary slightly):
Whoosh results for 'engine troubleshooting': [('Engine Troubleshooting 101', 3.688410483089193)]Whoosh results for 'piston': [('Piston Wear & Compression', 2.110439158172188)]
Notes on Whoosh behavior:
Whoosh returns documents where the query terms appear and ranks by TF-IDF-like importance.
If a conceptually relevant document doesn’t contain the exact query terms (e.g., “Motor Diagnostics Checklist” for the query “engine troubleshooting”), Whoosh will not return it unless the text contains matching tokens.
This example stores the index in a temporary directory. In production or repeated runs, persist the index directory or rebuild as needed. Remember to clean up temporary files to avoid disk bloat (shutil.rmtree(index_dir)).
2) Semantic search with SentenceTransformers (cosine similarity)
Semantic search embeds documents and queries into a vector space and ranks by vector similarity (cosine). This approach captures conceptual relationships beyond exact token overlap. We use the paraphrase-MiniLM-L6-v2 model for compact, fast embeddings.
# semantic_search.pyfrom sentence_transformers import SentenceTransformerfrom sklearn.metrics.pairwise import cosine_similarityimport numpy as npmodel = SentenceTransformer("paraphrase-MiniLM-L6-v2")# Query used for comparisonquery_text = "engine troubleshooting"# Embed documents and query (normalized embeddings)doc_embeddings = model.encode(docs, convert_to_numpy=True, normalize_embeddings=True)query_embedding = model.encode([query_text], convert_to_numpy=True, normalize_embeddings=True)# Compute cosine similarity and rank (descending)sims = cosine_similarity(query_embedding, doc_embeddings).flatten()sem_hits_idx = np.argsort(-sims)sem_hits = [(titles[i], float(sims[i])) for i in sem_hits_idx]# Top semantic hitssem_hits[:6]
Example semantic ranking (scores will vary by model version and environment):
The embedding model captures conceptual similarity, so queries like “engine troubleshooting” will surface “motor diagnostics” and “valve timing” even without exact word overlap.
Semantic retrieval improves recall for related documents; TF-IDF provides more precision for literal matches.
3) Compare TF-IDF and Semantic rankings side-by-side
We can combine Whoosh (TF-IDF) hits and SentenceTransformers (cosine similarity) hits into a pandas DataFrame to compare ranks and scores. This lets you directly inspect differences in ordering and the presence/absence of documents in each result set.
# compare_rankings.pyimport pandas as pd# Assume kw_hits is the list returned by the whoosh_query for the given querykw_hits = whoosh_query(query_text) # from the Whoosh example abovesem_hits = sem_hits # from the semantic example above# Normalize/pretty print both lists with rankkw_df = pd.DataFrame(kw_hits, columns=["Title", "TFIDF_Score"])kw_df["KW_Rank"] = range(1, len(kw_df) + 1)sem_df = pd.DataFrame(sem_hits, columns=["Title", "CosineSim"])sem_df["SEM_Rank"] = range(1, len(sem_df) + 1)# Merge on title to show both ranks togethercomparison = pd.merge(sem_df, kw_df, on="Title", how="outer")# Sort by semantic rank to highlight semantic orderingcomparison_sorted = comparison.sort_values(by="SEM_Rank", na_position="last")comparison_sorted[["Title", "SEM_Rank", "CosineSim", "KW_Rank", "TFIDF_Score"]]
Sample comparison table output:
Title SEM_Rank CosineSim KW_Rank TFIDF_Score0 Engine Troubleshooting 101 1 0.84 1.0 3.6884102 Valve Timing Problems 2 0.63 NaN NaN3 Motor Diagnostics Checklist 3 0.61 NaN NaN1 Piston Wear & Compression 4 0.46 NaN NaN4 Airflow & Intake Issues 5 0.42 NaN NaN5 Oil System Basics 6 0.35 NaN NaN
Performance and practical considerations:
Method
Strengths
Typical Use Cases
Example behavior on “engine troubleshooting”
TF-IDF (Whoosh)
Fast, interpretable, precise for exact token matches
Keyword search, filtering, small to medium corpora
Finds documents that contain the exact words “engine” and “troubleshooting” (high score)
Semantic (Embeddings)
Captures conceptual similarity, robust to paraphrases
QA retrieval, recommendation, broader recall
Returns “Motor Diagnostics Checklist” and “Valve Timing Problems” even without exact term overlap
Interpretation:
Whoosh (TF-IDF) excels at precision for literal queries and is simple to run locally.
Semantic search returns documents ranked by conceptual relevance and can surface related material that lacks exact tokens from the query.
A hybrid approach often works best: use TF-IDF for exact matches and embeddings to expand recall, or rerank TF-IDF candidates with embeddings for a balance of speed and semantic quality.
Conclusion
This lesson illustrated the differences between keyword (TF-IDF) search and semantic retrieval using a small corpus.
You can run the provided notebook-style code to experiment with queries, model choice, and ranking strategies.