Demo Testing your RAG System

Now that the hybrid RAG pipeline (BM25 + vector store + LLM) is running, this guide shows how to test it, inspect retrieved sources, and tune the main knobs (chunking, retrieval budget, reranking) to improve accuracy and reduce hallucination.

The image contains the text "Testing your RAG System" and "Demo" with a modern design layout. It also includes a copyright notice for KodeKloud.

Overview

Run queries against the hybrid retriever (BM25 + vector store) and inspect which document chunks are returned as sources.
Tune chunking parameters (size / overlap), per-retriever candidate counts (k_each), and final_k (how many merged candidates are passed to the LLM).
Check for hallucinations by asking questions that cannot be answered from your corpus.
Iterate: change one parameter at a time and re-run a fixed set of test queries to measure the effect.

Key helper functions (core snippets) The following core functions are typical building blocks for a small hybrid RAG prototype:

read_text_files: read .txt documents into a dict mapping filenames to text.
chunk_text: split large documents into overlapping chunks to maintain context during retrieval.
tokenize: simple tokenizer useful for building BM25 indices.
rrf_merge: Reciprocal Rank Fusion to merge ranked results from multiple retrievers.

Example implementation:

# python
from pathlib import Path
from typing import List, Dict
from collections import defaultdict
import re

def read_text_files(root: Path) -> Dict[str, str]:
    """Return dict[file_name] = text"""
    files = {}
    for p in root.glob("*.txt"):
        files[p.name] = p.read_text(encoding="utf-8")
    return files

def chunk_text(text: str, chunk_size: int = 800, overlap: int = 150) -> List[str]:
    """Split text into overlapping chunks of approx chunk_size characters."""
    chunks = []
    i = 0
    n = len(text)
    while i < n:
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

def tokenize(s: str) -> List[str]:
    """Very small tokenizer for BM25 corpus building."""
    # lowercase, split on non-word characters
    return [t for t in re.split(r"\W+", s.lower()) if t]

def rrf_merge(list_a: List[str], list_b: List[str], k: int = 60, topn: int = 5) -> List[str]:
    """Reciprocal Rank Fusion for two ranked lists of IDs."""
    scores = defaultdict(float)
    for lst in (list_a, list_b):
        for rank, _id in enumerate(lst):
            scores[_id] += 1.0 / (k + rank + 1)
    return [x for x, _ in sorted(scores.items(), key=lambda kv: kv[1], reverse=True)][:topn]

CLI arguments (parser snippet)

k_each: how many candidates to fetch from each retriever (BM25, vector).
final_k: how many merged, deduplicated candidates to include in the LLM prompt.

# python
import argparse

sp = argparse.ArgumentParser()
sub = sp.add_subparsers(dest="cmd", required=True)

p_ing = sub.add_parser("ingest")
p_ing.add_argument("--dir", required=True, help="Folder or .txt file")
p_ing.add_argument("--embed-model", default="nomic-embed-text")

p_ask = sub.add_parser("ask")
p_ask.add_argument("--query", required=True)
p_ask.add_argument("--llm", default="llama3:latest")
p_ask.add_argument("--embed-model", default="nomic-embed-text")
p_ask.add_argument("--k-each", type=int, default=6)
p_ask.add_argument("--final-k", type=int, default=5)

args = sp.parse_args()
if args.cmd == "ingest":
    ingest(args.dir, embedding_model=args.embed_model)
else:
    ask(args.query, llm_model=args.llm, embedding_model=args.embed_model,
        k_each=args.k_each, final_k=args.final_k)

First queries and expected behavior

Example corpus for these tests: two books — Adventures of Sherlock Holmes and Frankenstein.
Test question: “Who is the narrator of this document?”

Run:

# bash
python hybrid_rag.py ask --query "Who is the narrator of this document?"

Typical returned answer (example):

=== Answer ===
The narrator of "The Adventures of Sherlock Holmes" appears to be someone who is friends with Sherlock Holmes, likely Dr. John Watson, as indicated by the text in chunk 324 where it says "I remember" and "you on one occasion, in the early days of our friendship". However, the name "Dr. John Watson" may not explicitly appear in every returned chunk.

In contrast, the narrator of "Frankenstein" is not clearly identified in the given chunk.

--- Sources ---
adventuresofsherlockholmes.txt (chunk 324)
frankenstein.txt (chunk 48)
...

Notes:

If retrieval returns chunks from multiple books, that is expected for a mixed corpus. To isolate testing to a single book, reset the index and re-ingest only that book.

Resetting and re-ingesting If you need to reset the vector index (for example ChromaDB), remove or reset the vector store directory and re-run ingestion to build a deterministic test set.

Be careful when removing your vector store directory (e.g., rm -rf .chroma) — this will delete the indexed embeddings and cannot be undone unless you have a backup.

Example commands:

# bash: reset index (implementation-specific; this example removes the vector store directory)
rm -rf .chroma
python hybrid_rag.py ingest --dir data/  # re-ingest files in data/

After ingesting only Sherlock Holmes, re-run the same query and the system should be more consistent identifying Dr. Watson as the narrator. Tuning knobs and examples Use the following parameters to tune retrieval accuracy and LLM input quality.

Parameter	Purpose	Example / Suggested Change
`chunk_size`	Approximate characters per chunk. Larger preserves more context.	Increase from `800` → `1024` for broader context.
`overlap`	Characters that overlap between adjacent chunks. Helps with boundary-context questions.	Increase from `150` → `200`.
`k_each`	Number of candidates fetched per retriever (BM25, vector). Higher → more recall.	`--k-each=10`
`final_k`	Number of merged candidates passed to the LLM after dedupe/rerank. Constrained by LLM token budget.	`--final-k=10`

Chunk size and overlap
- Larger chunk size and more overlap preserve more contiguous context, improving answers for questions that require extended context (addresses, sequences, long descriptions).
- Example change: set defaults to chunk_size=1024 and overlap=200, then re-ingest.

# python
def chunk_text(text: str, chunk_size: int = 1024, overlap: int = 200) -> List[str]:
    ...

k_each and final_k (retrieval budget)
- k_each: how many candidates to pull from each retriever (per-retriever).
- final_k: number of merged candidates passed to the LLM after dedupe and rerank.
- Example usage:

# bash
python hybrid_rag.py ask --query "What is Holmes' address?" --k-each=10 --final-k=10

Observations and examples

Query phrasing matters. Some phrasings produce concise, accurate answers; others need more context from retrieval.
- Example: “Who lives on Baker Street?” often returns: “Sherlock Holmes lives on Baker Street, at 221B.”
- However, “What is Holmes’ address?” may produce “I don’t know.” if the retrieved chunks lack the exact address context.
Keyword searches can be very effective: searching for “221B” or “Irene Adler” often returns precise chunks.

Example:

# bash
python hybrid_rag.py ask --query "Who does Irene Adler marry?" --k-each=12 --final-k=10

=== Answer ===
Irene Adler marries Godfrey Norton, an English lawyer.

=== Sources ===
adventuresofsherlockholmes.txt (chunk 28)
adventuresofsherlockholmes.txt (chunk 36)
...

Testing for hallucination

Intentionally ask questions that are not covered by the corpus to verify the system returns “I don’t know” (or a safe decline) instead of inventing facts.
Example queries:
- “What is Holmes’ mother’s maiden name?”
- “Which smartphone did Holmes prefer?”

Best practices:

Ensure your prompt explicitly instructs the LLM: “Do not make assumptions. If the answer is not supported by the provided sources, respond: ‘I don’t know.’”
When the system hallucinates, revisit:
- Retrieval quality (chunking and overlap)
- Score thresholds and reranking logic
- Prompt engineering (explicit refusal instructions)
- Increasing final_k (balance against token budgets)

Example:

# bash
python hybrid_rag.py ask --query "What is Holmes' mother's maiden name?"
# => I don't know.

Testing strategies and checklist

Sanity checks: ask questions with known answers from the corpus.
Hallucination checks: ask questions that are outside the corpus.
Rephrase checks: ask the same question with different phrasing to test semantic search robustness.
Isolate tests: re-ingest a single document for deterministic behavior.
Systematic tuning: change one parameter at a time (chunk size, then k_each, then final_k) and record results.
Track compute: increasing k_each and final_k raises recall but also uses more compute and tokens.

Practical tip

When tuning, change one parameter at a time (e.g., chunk size, then k_each, then final_k) and re-run a fixed set of test queries so you can measure the effect of each change.

Summary

Testing a RAG system is iterative: tune chunking, retrieval budgets, and reranking while validating results with curated test queries.
Use deterministic checks (keyword-based queries) plus open-ended checks (possible hallucination prompts).
Add clear refusal instructions in your prompt so the LLM avoids unsupported inferences.
Maintain a fixed suite of positive and negative test queries to measure improvement over time.

Introduction

Document Processing and Chunking

Keyword Search & Retrieval

Semantic Search & Embeddings

Vector Databases

Building the RAG Pipeline

Demo Testing your RAG System

Watch Video