Guide to retrieval methods and system design for retrieval augmented generation covering hybrid, sparse and dense search, chunking, re-ranking, caching, security, observability, and operational trade-offs
This lesson explains retrieval methods and the critical design considerations for building retrieval-augmented generation (RAG) systems. Retrieval is the foundation of RAG quality: an LLM generates grounded answers only when the retrieved context is relevant and correct. Even a strong LLM will produce incorrect output if retrieval returns irrelevant or wrong documents.
Key system design dials
Latency — users expect fast responses; the retrieval path must be optimized for P95/P99 latency.
Cost — compute (re-rankers, LLM tokens) and storage (indexes, embeddings) grow with scale.
Trust — freshness, provenance, and cited answers increase user confidence.
Common failure modes
Wrong chunking: retrieving tangential fragments instead of the exact answer.
Stale index: returning outdated information because the index wasn’t refreshed.
ACL leaks: returning documents a user shouldn’t see.
Plan mitigations early (refresh schedules, fine-grained ACLs, chunking rules) and add observability so you can detect these failure modes quickly.
Retrieval pipeline (high level)
Ingest: split documents into chunks, create embeddings, and index them.
Query: convert user query to keywords + vector, perform filtered searches, and re-rank candidates.
Generation: pack and send the assembled context + query to the LLM to produce a grounded answer.
Each stage is an opportunity to measure, optimize, and secure data flows.
Cross-cutting concerns
Security: enforce who can see what (ACLs, filtering, query-time authorization).
Caching: accelerate the fast path to reduce latency and cost.
Observability: instrument ingest, retrieval, re-ranking, and generation for latency, error rates, and data correctness.
Retrieval methods — quick overview
Sparse search (keyword-based): precise for exact tokens, code snippets, error strings, and legal text.
Dense retrieval (embeddings): finds semantic matches and paraphrases; good for natural-language questions and cross-lingual queries.
Hybrid search: fuses sparse + dense signals (often via score fusion). A safe default because it captures both exact matches and paraphrases.
Recommendation: start with hybrid search for broad coverage, monitor failure modes, and then optimize weights and tuning based on real traffic data.Chunking basicsChunking breaks documents into retrieval units. Chunk size and overlap strongly affect recall and precision:
Smaller chunks → higher precision, but can lose global context.
Larger chunks → preserve context, but may decrease specificity and increase token costs.
Common chunking approaches
Fixed-size chunks (tokens or characters) with a small overlap (10–20%) to preserve continuity.
Semantic chunks aligned to natural boundaries (sections, paragraphs).
Structure-aware chunking that preserves headings, titles, and document hierarchy.
Preserve document structure and attach rich metadata (titles, section headers, source IDs). Metadata helps re-rankers and the LLM understand relationships between chunks.
Preserve structure and metadata
Keep titles and section headers with their content.
Maintain logical document hierarchy when chunking.
Add metadata fields (source, section ID, published timestamp) to each chunk.
Special casesTables, code blocks, images, and scanned documents often need dedicated extractors or OCR and syntax-aware chunkers. Source code should be treated differently from prose—preserve imports, function boundaries, and call graphs. Consider language-aware chunking for code and binary data.Vector searchTwo common ANN index families to consider:
Index family
When to use
Notes
HNSW (Hierarchical Navigable Small World)
Default for many production setups where recall and latency matter
Excellent recall and low-latency queries; higher memory footprint
IVF / IVF-PQ
Very large corpora where memory is constrained
More memory-efficient, can be faster at scale if tuned; higher tuning complexity
Important tuning knobs
K (candidates): number of nearest neighbors returned by the ANN search. Typical starting points: K=50 for re-ranking pipelines, K=10 for direct results.
efSearch / nprobe: controls internal search breadth. Higher values increase recall at the cost of latency (efSearch for HNSW; nprobe for IVF/IVF-PQ).
Start with HNSW defaults, measure P95 latency and Precision@K on real traffic, then tune K and efSearch/nprobe against your SLAs.
Re-ranking and context packingTypical flow
Initial retrieval (hybrid) returns top N candidates (e.g., top 50).
Cross-encoder re-ranker scores query-document pairs with a more expensive model and selects a smaller, high-precision set (e.g., top 5).
Context packing: dedupe, order by relevance and document structure, and assemble the context with citations for transparency.
This “cheap broad retrieval → expensive targeted re-ranking” pattern converts good retrieval into great retrieval.
Costs and trade-offs
Cross-encoder re-ranking often yields 15–30% better relevance for the final assembled context but increases compute cost.
Always set timeouts: if re-ranking exceeds the budget, fallback to initial candidates. Users generally prefer a fast, approximate answer to a slow, perfect one.
Graceful degradationImplement fallback paths to cheaper or cached results when re-rankers or downstream services are slow. Maintain user experience by returning timely answers even during partial failures.
Operational knobs and cachingKey configuration knobs to monitor and tune:
Hybrid weights (sparse vs dense).
Retrieve depth (K candidates).
ANN search parameters (efSearch / nprobe).
Re-rank depth (how many candidates to re-score).
Caching recommendations
Cache query embeddings for frequent queries.
Cache ANN candidate sets and assembled context packs for hot queries.
Use cache warming for predictable traffic (e.g., daily reports, scheduled queries).
Example SLOs (starting points)
Metric
Example target
Latency (end-to-end)
~700 ms (tune to your application)
Freshness (index lag)
~10 minutes (depends on data volatility)
Security
Zero critical ACL leaks
Availability
≥ 99.9%
Start conservative, measure real traffic, and iterate.Takeaways
Start with hybrid search as a safe default for production.
Invest in chunking and metadata: preserve structure, include titles/headers/source IDs, and choose chunk sizes and overlaps based on use case.
Instrument every stage (ingest, retrieval, re-ranking, context assembly) for latency, accuracy, and security metrics — you can’t optimize what you don’t measure.
Use re-ranking selectively where it produces measurable precision gains; don’t re-rank everything by default.
Implement timeouts and graceful degradation so the system remains responsive under load.
Begin with hybrid search, put effort into chunking and metadata, and instrument the pipeline. Optimize re-ranking and caching based on measured cost-benefit for your workload.