Limitations of Keyword Search

In this lesson we explain the limits of keyword search and why Retrieval‑Augmented Generation (RAG) benefits from semantic retrieval. You’ll learn how keyword search works, four common failure modes, how semantic retrieval addresses them, and a practical checklist for building a robust RAG pipeline.

Keyword search remains useful and widely used — this lesson explains where it excels and where semantic retrieval is necessary.

How keyword search works

At a high level, keyword search answers questions like “Which documents mention ‘engine’?” or “Which documents contain both ‘engine’ and ‘piston’?” Naively scanning every document at query time is slow, so search engines precompute a data structure called an inverted index. An inverted index maps each token (term) to the list of documents that contain it. At query time the engine looks up each query term and intersects those document lists to produce matches quickly — this precomputation is the performance secret behind keyword search.

Because the inverted index is term → document, exact token overlap is the strongest signal for retrieval. Lightweight preprocessing (tokenization, stemming, stop‑word removal) is applied to normalize queries and documents.

The image illustrates how keyword search works using an inverted index, linking words like "camshaft" and "engine" to corresponding documents.

Historically, even major search engines like Google started from this inverted‑index model rather than neural ranking. The inverted index remains a core building block for fast retrieval.

Ranking: which matches should appear first?

When many documents match a query (e.g., “piston valve timing”), ranking determines the order. Ranking combines signals like exact matches, near-exact matches, term frequency, and document frequency.

Exact match: query tokens appear verbatim (high weight).
Near-exact match: variations produced by stemming, pluralization, or synonym expansion.
Partial match: documents containing only a subset of query terms (lower weight).

Example: for the query “intake valve”, a document titled “Intake Valve Design” is an exact match. Documents like “Valve Timing System” are related but score lower unless synonyms or phrase matches are used.

The image illustrates the concept of exact match search, showing a query for "Intake Valve" and highlighting "Document 1" as a relevant match with "Intake Valve Design."

Systems can surface near-exact and partial matches via configurables like synonym maps and fuzzy matching; they are typically scored lower than exact matches.

The image illustrates a user interface for "Visualizing Match Types – Near-Exact Match," showing search results for "Intake Valve" with documents related to "Valve Timing System" and "Engine Pistons and Valves."

Scoring: TF, IDF, TF‑IDF, and BM25

Common lexical scoring components:

Term Frequency (TF): how often a term appears in a document.
Inverse Document Frequency (IDF): downweights terms that appear in many documents.
TF‑IDF: multiplies TF by IDF to favor documents where a term is frequent but not ubiquitous.
BM25: improves TF‑IDF by applying a saturation function to TF and normalizing for document length, reducing reward for keyword stuffing.

The image illustrates TF-IDF weighting for the term "piston" across three documents, indicating high, low, and medium term frequencies with 10, 2, and 5 mentions respectively.

TF‑IDF can favor long documents because they have more tokens, while BM25 compensates with length normalization and a saturation curve so that additional occurrences contribute diminishing returns.

The image presents a comparison of two types of documents discussed by BM25: a very long document mentioning "piston" 12 times among many topics, and a short document mentioning "piston" twice but focusing solely on pistons.

A typical shape comparison shows TF‑IDF scores growing roughly linearly with term occurrences, while BM25 flattens after a point due to saturation.

The image is a line graph comparing TF-IDF and BM25 scores against term occurrence, illustrating the saturation effect. TF-IDF increases steeply, while BM25 and term occurrence grow more moderately.

These components — inverted indexes, TF‑IDF/BM25 scoring, and ranking heuristics — form the backbone of modern keyword retrieval.

The image displays the title "The Invisible Librarian at Work" and a labeled section about "Organization," describing the process of building inverted indexes that map every word to its documents.

Four failure modes of keyword search

Even with BM25 and query expansion, keyword search fails in predictable ways. Below are four common failure modes and how semantic retrieval (dense embeddings + re-ranking) addresses them.

Failure Mode	Why it fails	How semantic retrieval helps
Lexical mismatch (synonyms, acronyms)	Tokens differ (e.g., `2FA`, `two-factor authentication`, `MFA`) — low recall unless you add synonyms manually.	Embeddings capture semantic equivalence; hybrid search + synonym maps improves recall.
Polysemy & context blindness	Bag-of-words ignores context (e.g., “Jaguar speed” could be animal or car).	Contextual embeddings encode surrounding words, enabling disambiguation and better re-ranking.
Long documents & passage granularity	Answer may be a paragraph inside a long PDF; document-level scoring buries passages.	Chunk/passage-level embeddings and retrieval return focused passages for the LLM to consume.
Noisy language (typos, variants, cross‑language)	Typos, spelling variants, or cross-language queries create OOV tokens and tokenization issues.	Multilingual/tolerant embeddings, fuzzy matching, and better tokenization reduce noise sensitivity.

Lexical mismatch (synonyms, acronyms)
- Example: 2FA, two-factor authentication, and MFA are lexically different but equivalent. Keyword search needs synonym maps or query expansion to achieve good recall.
- Query expansion can help, but it requires manual curation or domain rules.

The image presents a topic, "Failure #1: Lexical Mismatch," with reasons why keywords struggle, such as "No overlap, low recall" and "Synonyms must be added manually."

Polysemy and context blindness
- Example: “Jaguar speed” could refer to the animal or the car; “Java memory model” could mean the programming language or the island. Bag‑of‑words lacks context to disambiguate.
- Mitigations: phrase queries, field boosts, or manual disambiguation. RAG uses contextual embeddings that encode neighboring words, improving disambiguation and enabling semantic matches (e.g., mapping 2FA to two-factor authentication within context).

The image explains polysemy and context blindness in search, highlighting challenges with keyword searches and how RAG (Retrieval-Augmented Generation) addresses these issues with context and re-ranking.

Note: very short or ambiguous queries remain challenging even for semantic systems — adding context, session history, or query expansion often helps.

Long documents and passage granularity
- The relevant answer may be a short passage inside a long document. Document-level indexing treats the whole file as one item and can bury the passage.
- Best practice: chunk documents into passages and index passages as first-class retrieval units. RAG systems commonly use passage-level embeddings so the LLM receives concise, relevant content.
Noisy language (typos, variants, cross-language)
- Typos (e.g., “ingress” vs “inrgess”), spelling variants (licence vs license), or a query in a different language create tokenization and OOV problems.
- Use multilingual or fuzzily tolerant embeddings, improve tokenization, and normalize text to reduce noise impact.

The image explains "Failure #4: Noisy Language," detailing why keywords struggle with OOV words, small fuzzy distance, and language-specific tokenization, and how RAG addresses these with language-specific tokenization and cross-language semantic matching.

Implementation checklist for RAG

When combining keyword and semantic retrieval, follow this practical checklist:

Use hybrid retrieval (BM25 + dense vectors) to combine precise keyword matching with semantic recall.
Optimize chunking: common chunk sizes are 300–600 tokens with 10–20% overlap; tune based on document type.
Select embeddings that match your domain and language requirements (domain‑specific or multilingual models).
Apply cross‑encoder re‑ranking: re-rank the top‑K hits from BM25 + dense retrieval with a cross‑encoder for higher precision.
Use query expansion techniques such as hypothetical document embeddings or curated synonym maps for very short or ambiguous queries.
Maintain data hygiene: dedupe, strip boilerplate/navigation, normalize text, and correct common spelling errors.
Maintain synonym/acronym maps for known domain equivalences (e.g., 2FA ↔ two-factor authentication).
Evaluate with realistic test sets and metrics: recall@K, MRR, and “answer found in top K.”

Hybrid search preserves the speed and determinism of inverted indexes while extending recall with embeddings and re‑ranking.

Hybrid search preserves the speed and precision of keyword methods while extending recall and semantic matching through embeddings.

When to keep using keyword search

Keyword search is still the right tool for many cases:

Exact lookups: IDs, SKUs, and log lines.
Compliance and audit workflows where deterministic operator semantics (AND/OR/NOT) matter.
Power users familiar with precise query syntax and domain jargon.
Very large-scale, low-latency lookups that depend on inverted indexes for performance.

Keyword search is fast, well-understood, and robust — RAG augments it rather than fully replacing it.

This lesson summarized how keyword search works, common limitations that motivate semantic retrieval, and practical steps to combine lexical and dense retrieval effectively in RAG systems.

Links and references

TF‑IDF — Wikipedia
BM25 — Information Retrieval Background
Retrieval‑Augmented Generation (RAG) — Paper / Overview
Kubernetes Basics (example external reference)

Introduction

Document Processing and Chunking

Keyword Search & Retrieval

Semantic Search & Embeddings

Vector Databases

Building the RAG Pipeline

Limitations of Keyword Search

How keyword search works

Ranking: which matches should appear first?

Scoring: TF, IDF, TF‑IDF, and BM25

Four failure modes of keyword search

Implementation checklist for RAG

When to keep using keyword search

Links and references

Watch Video

​How keyword search works

​Ranking: which matches should appear first?

​Scoring: TF, IDF, TF‑IDF, and BM25

​Four failure modes of keyword search

​Implementation checklist for RAG

​When to keep using keyword search

​Links and references

Watch Video

How keyword search works

Ranking: which matches should appear first?

Scoring: TF, IDF, TF‑IDF, and BM25

Four failure modes of keyword search

Implementation checklist for RAG

When to keep using keyword search

Links and references