
- Run queries against the hybrid retriever (BM25 + vector store) and inspect which document chunks are returned as sources.
- Tune chunking parameters (size / overlap), per-retriever candidate counts (
k_each), andfinal_k(how many merged candidates are passed to the LLM). - Check for hallucinations by asking questions that cannot be answered from your corpus.
- Iterate: change one parameter at a time and re-run a fixed set of test queries to measure the effect.
read_text_files: read.txtdocuments into a dict mapping filenames to text.chunk_text: split large documents into overlapping chunks to maintain context during retrieval.tokenize: simple tokenizer useful for building BM25 indices.rrf_merge: Reciprocal Rank Fusion to merge ranked results from multiple retrievers.
k_each: how many candidates to fetch from each retriever (BM25, vector).final_k: how many merged, deduplicated candidates to include in the LLM prompt.
- Example corpus for these tests: two books — Adventures of Sherlock Holmes and Frankenstein.
- Test question: “Who is the narrator of this document?”
- If retrieval returns chunks from multiple books, that is expected for a mixed corpus. To isolate testing to a single book, reset the index and re-ingest only that book.
Be careful when removing your vector store directory (e.g.,
rm -rf .chroma) — this will delete the indexed embeddings and cannot be undone unless you have a backup.| Parameter | Purpose | Example / Suggested Change |
|---|---|---|
chunk_size | Approximate characters per chunk. Larger preserves more context. | Increase from 800 → 1024 for broader context. |
overlap | Characters that overlap between adjacent chunks. Helps with boundary-context questions. | Increase from 150 → 200. |
k_each | Number of candidates fetched per retriever (BM25, vector). Higher → more recall. | --k-each=10 |
final_k | Number of merged candidates passed to the LLM after dedupe/rerank. Constrained by LLM token budget. | --final-k=10 |
- Chunk size and overlap
- Larger chunk size and more overlap preserve more contiguous context, improving answers for questions that require extended context (addresses, sequences, long descriptions).
- Example change: set defaults to
chunk_size=1024andoverlap=200, then re-ingest.
k_eachandfinal_k(retrieval budget)k_each: how many candidates to pull from each retriever (per-retriever).final_k: number of merged candidates passed to the LLM after dedupe and rerank.- Example usage:
- Query phrasing matters. Some phrasings produce concise, accurate answers; others need more context from retrieval.
- Example: “Who lives on Baker Street?” often returns: “Sherlock Holmes lives on Baker Street, at 221B.”
- However, “What is Holmes’ address?” may produce “I don’t know.” if the retrieved chunks lack the exact address context.
- Keyword searches can be very effective: searching for “221B” or “Irene Adler” often returns precise chunks.
- Intentionally ask questions that are not covered by the corpus to verify the system returns “I don’t know” (or a safe decline) instead of inventing facts.
- Example queries:
- “What is Holmes’ mother’s maiden name?”
- “Which smartphone did Holmes prefer?”
- Ensure your prompt explicitly instructs the LLM: “Do not make assumptions. If the answer is not supported by the provided sources, respond: ‘I don’t know.’”
- When the system hallucinates, revisit:
- Retrieval quality (chunking and overlap)
- Score thresholds and reranking logic
- Prompt engineering (explicit refusal instructions)
- Increasing
final_k(balance against token budgets)
- Sanity checks: ask questions with known answers from the corpus.
- Hallucination checks: ask questions that are outside the corpus.
- Rephrase checks: ask the same question with different phrasing to test semantic search robustness.
- Isolate tests: re-ingest a single document for deterministic behavior.
- Systematic tuning: change one parameter at a time (chunk size, then
k_each, thenfinal_k) and record results. - Track compute: increasing
k_eachandfinal_kraises recall but also uses more compute and tokens.
When tuning, change one parameter at a time (e.g., chunk size, then
k_each, then final_k) and re-run a fixed set of test queries so you can measure the effect of each change.- Testing a RAG system is iterative: tune chunking, retrieval budgets, and reranking while validating results with curated test queries.
- Use deterministic checks (keyword-based queries) plus open-ended checks (possible hallucination prompts).
- Add clear refusal instructions in your prompt so the LLM avoids unsupported inferences.
- Maintain a fixed suite of positive and negative test queries to measure improvement over time.