Skip to main content
This lesson explains retrieval methods and the critical design considerations for building retrieval-augmented generation (RAG) systems. Retrieval is the foundation of RAG quality: an LLM generates grounded answers only when the retrieved context is relevant and correct. Even a strong LLM will produce incorrect output if retrieval returns irrelevant or wrong documents.
The image explains the importance of retrieval in Retrieval-Augmented Generation, highlighting that if retrieval is incorrect, the resulting answer will be wrong despite a strong model, showing a combination of LLM and a knowledge base.
Key system design dials
  • Latency — users expect fast responses; the retrieval path must be optimized for P95/P99 latency.
  • Cost — compute (re-rankers, LLM tokens) and storage (indexes, embeddings) grow with scale.
  • Trust — freshness, provenance, and cited answers increase user confidence.
The image explains why retrieval matters, highlighting two points: latency issues because people won’t wait, and cost concerns as compute and storage add up.
Common failure modes
  • Wrong chunking: retrieving tangential fragments instead of the exact answer.
  • Stale index: returning outdated information because the index wasn’t refreshed.
  • ACL leaks: returning documents a user shouldn’t see.
Plan mitigations early (refresh schedules, fine-grained ACLs, chunking rules) and add observability so you can detect these failure modes quickly.
The image explains why retrieval matters, highlighting three issues: wrong chunk retrieval resulting in irrelevant content, stale index causing outdated information, and ACL leak leading to unauthorized data exposure.
Retrieval pipeline (high level)
  • Ingest: split documents into chunks, create embeddings, and index them.
  • Query: convert user query to keywords + vector, perform filtered searches, and re-rank candidates.
  • Generation: pack and send the assembled context + query to the LLM to produce a grounded answer.
Each stage is an opportunity to measure, optimize, and secure data flows.
The image outlines a retrieval pipeline consisting of three phases: Ingest Phase for data indexing, Query Phase for retrieval and filtering, and LLM Generation for producing responses.
Cross-cutting concerns
  • Security: enforce who can see what (ACLs, filtering, query-time authorization).
  • Caching: accelerate the fast path to reduce latency and cost.
  • Observability: instrument ingest, retrieval, re-ranking, and generation for latency, error rates, and data correctness.
The image is a diagram titled "Retrieval Pipeline at a Glance," featuring three key components: Security, Caching, and Observability, each with a brief description and an icon.
Retrieval methods — quick overview
  • Sparse search (keyword-based): precise for exact tokens, code snippets, error strings, and legal text.
  • Dense retrieval (embeddings): finds semantic matches and paraphrases; good for natural-language questions and cross-lingual queries.
  • Hybrid search: fuses sparse + dense signals (often via score fusion). A safe default because it captures both exact matches and paraphrases.
Recommendation: start with hybrid search for broad coverage, monitor failure modes, and then optimize weights and tuning based on real traffic data. Chunking basics Chunking breaks documents into retrieval units. Chunk size and overlap strongly affect recall and precision:
  • Smaller chunks → higher precision, but can lose global context.
  • Larger chunks → preserve context, but may decrease specificity and increase token costs.
The image illustrates the concept of chunking with the emphasis on "High Recall Wins" and explains that smaller chunks provide better precision but may lose context.
Common chunking approaches
  • Fixed-size chunks (tokens or characters) with a small overlap (10–20%) to preserve continuity.
  • Semantic chunks aligned to natural boundaries (sections, paragraphs).
  • Structure-aware chunking that preserves headings, titles, and document hierarchy.
Preserve document structure and attach rich metadata (titles, section headers, source IDs). Metadata helps re-rankers and the LLM understand relationships between chunks.
The image explains "Chunking Basics: High Recall Wins" with two types of text chunking: fixed-size chunks and semantic chunks.
Preserve structure and metadata
  1. Keep titles and section headers with their content.
  2. Maintain logical document hierarchy when chunking.
  3. Add metadata fields (source, section ID, published timestamp) to each chunk.
The image outlines "Chunking Basics" for achieving high recall, emphasizing the importance of preserving structure through three steps: keeping titles and section headers with content, maintaining logical document hierarchy, and adding rich metadata.
Special cases Tables, code blocks, images, and scanned documents often need dedicated extractors or OCR and syntax-aware chunkers. Source code should be treated differently from prose—preserve imports, function boundaries, and call graphs. Consider language-aware chunking for code and binary data. Vector search Two common ANN index families to consider:
Index familyWhen to useNotes
HNSW (Hierarchical Navigable Small World)Default for many production setups where recall and latency matterExcellent recall and low-latency queries; higher memory footprint
IVF / IVF-PQVery large corpora where memory is constrainedMore memory-efficient, can be faster at scale if tuned; higher tuning complexity
Important tuning knobs
  • K (candidates): number of nearest neighbors returned by the ANN search. Typical starting points: K=50 for re-ranking pipelines, K=10 for direct results.
  • efSearch / nprobe: controls internal search breadth. Higher values increase recall at the cost of latency (efSearch for HNSW; nprobe for IVF/IVF-PQ).
Start with HNSW defaults, measure P95 latency and Precision@K on real traffic, then tune K and efSearch/nprobe against your SLAs.
The image is a slide titled "Vector Search in One Slide," highlighting two parameters: "K (candidates)" for retrieving nearest neighbors and "efSearch/nprobe" for controlling the recall-latency tradeoff, with a suggestion to start with HNSW defaults and monitor performance metrics.
Re-ranking and context packing Typical flow
  1. Initial retrieval (hybrid) returns top N candidates (e.g., top 50).
  2. Cross-encoder re-ranker scores query-document pairs with a more expensive model and selects a smaller, high-precision set (e.g., top 5).
  3. Context packing: dedupe, order by relevance and document structure, and assemble the context with citations for transparency.
This “cheap broad retrieval → expensive targeted re-ranking” pattern converts good retrieval into great retrieval.
The image outlines a process for reranking and context packing, including initial retrieval of candidates, cross-encoder reranking of query-document pairs, and context packing for relevance and citation.
Costs and trade-offs
  • Cross-encoder re-ranking often yields 15–30% better relevance for the final assembled context but increases compute cost.
  • Always set timeouts: if re-ranking exceeds the budget, fallback to initial candidates. Users generally prefer a fast, approximate answer to a slow, perfect one.
The image outlines the benefits of reranking and context packing, highlighting improved precision, 15-30% better relevance, and enhanced retrieval in RAG systems.
Graceful degradation Implement fallback paths to cheaper or cached results when re-rankers or downstream services are slow. Maintain user experience by returning timely answers even during partial failures.
The image is a slide titled "Reranking and Context Packing," explaining why time budgets matter. It lists three points: setting timeouts, valuing speed over perfect answers, and maintaining performance during slowdowns.
Operational knobs and caching Key configuration knobs to monitor and tune:
  • Hybrid weights (sparse vs dense).
  • Retrieve depth (K candidates).
  • ANN search parameters (efSearch / nprobe).
  • Re-rank depth (how many candidates to re-score).
Caching recommendations
  • Cache query embeddings for frequent queries.
  • Cache ANN candidate sets and assembled context packs for hot queries.
  • Use cache warming for predictable traffic (e.g., daily reports, scheduled queries).
The image outlines a caching strategy involving embeddings, candidate sets, and context packs with suggestions for caching query embeddings, ANN results, and assembled contexts. It also recommends using cache warming for predictable traffic patterns.
Example SLOs (starting points)
MetricExample target
Latency (end-to-end)~700 ms (tune to your application)
Freshness (index lag)~10 minutes (depends on data volatility)
SecurityZero critical ACL leaks
Availability≥ 99.9%
Start conservative, measure real traffic, and iterate. Takeaways
  • Start with hybrid search as a safe default for production.
  • Invest in chunking and metadata: preserve structure, include titles/headers/source IDs, and choose chunk sizes and overlaps based on use case.
  • Instrument every stage (ingest, retrieval, re-ranking, context assembly) for latency, accuracy, and security metrics — you can’t optimize what you don’t measure.
  • Use re-ranking selectively where it produces measurable precision gains; don’t re-rank everything by default.
  • Implement timeouts and graceful degradation so the system remains responsive under load.
Begin with hybrid search, put effort into chunking and metadata, and instrument the pipeline. Optimize re-ranking and caching based on measured cost-benefit for your workload.
Further reading and references This material next dives deeper into re-ranking and how to measure its cost-benefit trade-offs before rolling it out broadly.

Watch Video