Skip to main content
Welcome back. In this lesson we build a practical hybrid RAG (Retrieval-Augmented Generation) pipeline that combines BM25 keyword search with semantic vector search (ChromaDB), fuses results using Reciprocal Rank Fusion (RRF), and uses an Ollama LLM to produce grounded answers. The pipeline ingests a folder of .txt documents (for example, Project Gutenberg books), chunks them, indexes chunks into ChromaDB, persists tokenized chunks for BM25, and executes hybrid queries that blend both retrieval signals to produce a final, source-attributed answer. We use a local Ollama instance for both embeddings and the LLM so everything can run locally. Highlights
  • Chunk and index plain-text books for retrieval.
  • Build a BM25 index for exact keyword matches.
  • Build a vector index (ChromaDB) for semantic matches using Ollama embeddings.
  • Fuse BM25 and vector rankings via RRF to improve robustness.
  • Ask queries that return concise, source-grounded answers from an Ollama chat model.
Overview of the approach
  • Read .txt files from a folder.
  • Chunk the documents and store per-chunk metadata.
  • Create a BM25 corpus (tokenized chunks, persisted as pickle files).
  • Embed chunks with Ollama and upsert into a persistent Chroma collection.
  • On query: run BM25 to get top keyword hits and run Chroma vector search for semantic hits. Merge lists with RRF, fetch fused chunks, construct context, and ask an Ollama chat model to produce a concise answer that cites sources.
We demonstrate this with a public-domain text: Frankenstein by Mary Shelley. I added Frankenstein.txt to the data folder (Project Gutenberg eBook).
The image shows a Visual Studio Code window displaying the text from the Project Gutenberg eBook of "Frankenstein; Or, The Modern Prometheus" by Mary Wollstonecraft Shelley. It includes information about the eBook's usage rights, author, release date, and language.
Callouts — quick setup and warning
Before running the pipeline, install the required Python packages and ensure you have a running local Ollama instance and the Ollama models you plan to use.
Ollama models must be installed locally and served by an Ollama daemon. This demo assumes Ollama is reachable on the default local socket. If Ollama is not running or models are missing, embedding and LLM calls will fail.
Prerequisites
  • Python 3.8+.
  • Local Ollama installed and running with:
    • an embedding model (e.g., nomic-embed-text)
    • an LLM (e.g., llama:3.3:latest)
  • Chroma will store a local persistent collection in .chroma/.
Install Python dependencies:
pip install chromadb ollama rank-bm25 tqdm
Key files
  • hybrid_rag.py — single script demonstrating the full pipeline.
  • data/frankenstein.txt — example corpus (Project Gutenberg).
Architecture and component mapping
ComponentPurposeExample / CLI
Keyword retrieverFast exact-term matchingrank-bm25
Semantic retrieverConceptual/semantic similarityChromaDB + Ollama embeddings
FusionMerge rankings from both retrieversReciprocal Rank Fusion (RRF)
LLMGenerate final grounded answerOllama chat model
Below is the hybrid_rag.py script presented in logical chunks: imports & constants, embedding helpers, utilities, RRF fusion, ingest, BM25 loader, ask function, and CLI. Note: keep each code block intact when assembling the full file. Imports and constants
#!/usr/bin/env python3
import argparse
import hashlib
import json
import os
import pickle
import re
import shutil
import sys
from collections import defaultdict
from pathlib import Path
from typing import Dict, List, Tuple

from tqdm import tqdm
import chromadb
from rank_bm25 import BM25Okapi
import ollama

CHROMA_DIR = Path(".chroma")
COLLECTION_NAME = "books"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "llama:3.3:latest"
TOP_K = 5

INDEX_DIR = Path("index")
INDEX_DIR.mkdir(exist_ok=True)
BM25_CORPUS_PKL = INDEX_DIR / "bm25_corpus_tokens.pkl"
BM25_IDS_PKL = INDEX_DIR / "bm25_ids.pkl"
Embedding helper wrappers
def _embed(text: str, model: str = EMBED_MODEL) -> List[float]:
    """Support both prompt= and input= depending on Ollama client version."""
    try:
        return ollama.embeddings(model=model, prompt=text)["embedding"]
    except TypeError:
        return ollama.embeddings(model=model, input=text)["embedding"]

def _generate(prompt: str, model: str = LLM_MODEL, temperature: float = 0.2) -> str:
    resp = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": temperature},
    )
    return resp["message"]["content"].strip()
File reading, chunking, and tokenization utilities
def read_text_files(root: Path) -> Dict[str, str]:
    """Read all .txt files under root (or a single .txt file)."""
    files = []
    if root.is_file() and root.suffix.lower() == ".txt":
        files = [root]
    else:
        files = list(root.rglob("*.txt"))
    out: Dict[str, str] = {}
    for f in files:
        try:
            out[str(f)] = f.read_text(encoding="utf-8", errors="ignore")
        except Exception:
            out[str(f)] = f.read_text(encoding="latin-1", errors="ignore")
    return out

def chunk_text(text: str, chunk_size: int = 800, overlap: int = 150) -> List[str]:
    """Simple sliding-window chunker. Tune chunk_size and overlap for your use case."""
    text = re.sub(r"\s+", " ", text).strip()
    chunks: List[str] = []
    i = 0
    while i < len(text):
        chunk = text[i : i + chunk_size]
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

def tokenize(s: str) -> List[str]:
    """Very simple tokenizer for BM25 demo: split on alphanumerics and lowercase."""
    return re.findall(r"[a-zA-Z0-9]+", s.lower())
Reciprocal Rank Fusion (RRF) merge
def rrf_merge(list_a: List[str], list_b: List[str], k: int = 60, topn: int = 5) -> List[str]:
    """
    Reciprocal Rank Fusion for two ranked lists of IDs.
    Each list should be ordered from most relevant to least.
    """
    scores = defaultdict(float)
    for lst in [list_a, list_b]:
        for rank, id_ in enumerate(lst):
            scores[id_] += 1.0 / (k + rank + 1)
    return [x for x, _ in sorted(scores.items(), key=lambda kv: kv[1], reverse=True)][:topn]
Ingest function — chunk, embed, upsert to Chroma, and save BM25 data
def ingest(dir_path: str, embedding_model: str = EMBED_MODEL):
    src = Path(dir_path)
    docs = read_text_files(src)
    if not docs:
        raise SystemExit(f"No .txt files found under: {src}")

    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collection = client.get_or_create_collection(COLLECTION_NAME)

    all_ids, all_metadatas, all_docs = [], [], []
    bm25_tokens, bm25_ids = [], []

    print(f"[ingest] Reading and chunking {len(docs)} file(s)...")
    for file_path, text in docs.items():
        chunks = chunk_text(text)
        for idx, ch in enumerate(chunks):
            uid = f"{file_path}::chunk-{idx}"
            all_ids.append(uid)
            all_docs.append(ch)
            all_metadatas.append({"source": file_path, "chunk": idx})
            bm25_tokens.append(tokenize(ch))
            bm25_ids.append(uid)

    print(f"[ingest] Embedding {len(all_docs)} chunks with Ollama ({embedding_model})...")
    embeddings = []
    for ch in tqdm(all_docs):
        e = _embed(ch, model=embedding_model)
        embeddings.append(e)

    print("[ingest] Upserting into Chroma...")
    # batched add to avoid payload limits
    BATCH = 256
    for i in range(0, len(all_ids), BATCH):
        collection.add(
            ids=all_ids[i : i + BATCH],
            embeddings=embeddings[i : i + BATCH],
            documents=all_docs[i : i + BATCH],
            metadatas=all_metadatas[i : i + BATCH],
        )

    print("[ingest] Writing BM25 corpus tokens...")
    with open(BM25_CORPUS_PKL, "wb") as f:
        pickle.dump(bm25_tokens, f)
    with open(BM25_IDS_PKL, "wb") as f:
        pickle.dump(bm25_ids, f)

    print("[ingest] Done.")
Load BM25 index (tokens and ids)
def _load_bm25() -> Tuple[BM25Okapi, List[str]]:
    if not BM25_CORPUS_PKL.exists() or not BM25_IDS_PKL.exists():
        raise SystemExit("BM25 index files not found. Run ingest first.")
    with open(BM25_CORPUS_PKL, "rb") as f:
        tokens = pickle.load(f)
    with open(BM25_IDS_PKL, "rb") as f:
        ids = pickle.load(f)
    bm25 = BM25Okapi(tokens)
    return bm25, ids
ask function — run BM25 and vector searches, fuse, fetch, and query LLM
def ask(
    query: str,
    llm_model: str = LLM_MODEL,
    embedding_model: str = EMBED_MODEL,
    k_each: int = 6,
    final_k: int = 5,
):
    # BM25 side
    bm25, bm25_ids = _load_bm25()
    q_tokens = tokenize(query)
    scores = bm25.get_scores(q_tokens)
    # Get top indices and map to IDs
    bm25_top_idx = list(reversed(sorted(range(len(scores)), key=lambda i: scores[i])))[:k_each]
    bm25_top_ids = [bm25_ids[i] for i in bm25_top_idx]

    # Vector side (Chroma)
    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collection = client.get_or_create_collection(COLLECTION_NAME)
    q_emb = _embed(query, model=embedding_model)
    vec = collection.query(query_embeddings=[q_emb], n_results=k_each)
    vec_ids = [doc_id for doc_id in vec["ids"][0]]

    # Merge via RRF
    fused_ids = rrf_merge(bm25_top_ids, vec_ids, topn=final_k)

    # Fetch fused docs for context
    got = collection.get(ids=fused_ids)
    id_to_doc = dict(zip(got["ids"], got["documents"]))
    id_to_meta = dict(zip(got["ids"], got["metadatas"]))

    # Build context with simple headers
    sections = []
    for _id in fused_ids:
        meta = id_to_meta[_id]
        src = Path(meta["source"]).name
        sections.append(f"Source: {src} [chunk {meta['chunk']}]\n{id_to_doc[_id]}")
    context = "\n\n---\n\n".join(sections)

    # System/user prompt: force the model to use only the provided context.
    system = (
        "You are a concise assistant for a retrieval-augmented CLI.\n"
        "Answer ONLY using the provided context. If the answer is not present, say you don't know."
    )
    user = f"Context:\n\n{context}\n\nQuestion: {query}"

    resp = ollama.chat(
        model=llm_model,
        messages=[{"role": "system", "content": system}, {"role": "user", "content": user}],
        options={"temperature": 0.2},
    )
    answer = resp["message"]["content"].strip()

    # Print the answer and the sources
    print("\n=== Answer ===\n")
    print(answer)
    print("\n=== Sources ===")
    for _id in fused_ids:
        m = id_to_meta[_id]
        print(f"{Path(m['source']).name} (chunk {m['chunk']})")
Here’s a screenshot showing the code in editor as we prepare the ask function and the LLM call.
The image shows a Visual Studio Code interface with Python code. It includes a function definition and an autocomplete suggestion box.
CLI entry point — argparse for ingest and ask
def main():
    p = argparse.ArgumentParser(description="Hybrid RAG CLI: BM25 + Chroma + Ollama")
    sp = p.add_subparsers(dest="cmd", required=True)

    p_ing = sp.add_parser("ingest", help="Ingest .txt files into Chroma and build BM25 index")
    p_ing.add_argument("--dir", required=True, help="Folder or .txt file")
    p_ing.add_argument("--embed-model", default=EMBED_MODEL, help="Embedding model for Ollama")

    p_ask = sp.add_parser("ask", help="Ask a question against the ingested corpus")
    p_ask.add_argument("--query", required=True, help="Query string")
    p_ask.add_argument("--llm", default=LLM_MODEL, help="LLM model for Ollama")
    p_ask.add_argument("--embed-model", default=EMBED_MODEL, help="Embedding model for Ollama")
    p_ask.add_argument("--k-each", type=int, default=6, help="Top-K to fetch from each retriever")
    p_ask.add_argument("--final-k", type=int, default=5, help="Final top-K after RRF fusion")

    args = p.parse_args()
    if args.cmd == "ingest":
        ingest(args.dir, embedding_model=args.embed_model)
    else:
        ask(
            args.query,
            llm_model=args.llm,
            embedding_model=args.embed_model,
            k_each=args.k_each,
            final_k=args.final_k,
        )

if __name__ == "__main__":
    main()
Usage examples
  • Ingest a folder (e.g., the data folder with frankenstein.txt):
python hybrid_rag.py ingest --dir data
Expected ingest output (example):
[ingest] Reading and chunking 1 file(s)...
[ingest] Embedding 673 chunks with Ollama (nomic-embed-text)... 100%
[ingest] Upserting into Chroma...
[ingest] Writing BM25 corpus tokens...
[ingest] Done.
  • Ask a question:
python hybrid_rag.py ask --query "Who is Robert Walton writing to?"
Example output:
=== Answer ===

Robert Walton is writing to his sister, Mrs. Saville, in England.

=== Sources ===
frankenstein.txt (chunk 23)
frankenstein.txt (chunk 607)
frankenstein.txt (chunk 36)
frankenstein.txt (chunk 608)
frankenstein.txt (chunk 335)
Notes about behavior and best practices
  • Why hybrid retrieval? BM25 excels at precise keyword matching (high precision for exact terms). Embedding-based retrieval (semantic search) excels at recall for conceptually related content. Blending both often yields more robust retrieval in realistic applications.
  • RRF is a simple, effective fusion strategy to combine two ranked lists into a single ordered result set.
  • The prompt instructs the LLM to “Answer ONLY using the provided context” to reduce hallucinations. In practice, fine-tune prompts, retrieval sizes (k_each and final_k), and chunk sizes to balance precision and recall.
  • Chunk size and overlap are tunable knobs. For a local demo, a chunk size of ~800 characters with a 150-character overlap worked well; adjust based on your documents and the model context window.
Testing the system
  • Try open-ended queries, e.g., “When does Victor animate the creature?” — the system will either find a grounded answer in the retrieved chunks or reply “I don’t know” when the context lacks a definitive answer.
  • Add more books (e.g., Sherlock Holmes) to the data/ directory and re-run ingest to build a multi-document corpus. Hybrid retrieval will return relevant chunks across documents.
Summary This pipeline demonstrates a practical hybrid RAG setup using:
  • Local embeddings via Ollama,
  • Persistent semantic store via ChromaDB,
  • Keyword retrieval via BM25,
  • Fusion via RRF,
  • Final response generation via an Ollama chat model instructed to use only the retrieved context.
Links and References The repository for this lesson includes the full hybrid_rag.py script and example data/ files so you can run it locally and extend it for your own document corpora.

Watch Video