Demo Hello RAG

This guide walks you through building a minimal Retrieval-Augmented Generation (RAG) pipeline end-to-end using Ollama for embedding and generation and ChromaDB for vector storage and retrieval. The goal is a compact, runnable demo that indexes a single in-memory document, retrieves it by semantic similarity, and produces a grounded answer from a local LLM. Goals

Use Ollama for embeddings and LLM generation.
Use ChromaDB (Chroma) for persistence and similarity search.
Run a minimal RAG round trip with one in-memory document.

Key components

Component	Purpose	Example / Notes
Ollama	Local embeddings + LLM generation	Ollama
ChromaDB	Persistent vector store + retrieval	ChromaDB
Demo script	Index one doc, query, retrieve, and prompt LLM	`python app_v1.py demo`

Prerequisites / context

A data/ folder with Project Gutenberg text files (optional for this tiny demo). Example sources: Project Gutenberg.
Ollama running locally. Start it with ollama serve if not running.
Two Ollama models available locally: an embedding model and an LLM model (examples below).

Install dependencies Create a Python virtual environment and install the required packages:

python3 -m venv .venv
source .venv/bin/activate
pip install chromadb ollama

Pull the Ollama models you plan to use locally (examples shown):

# Embeddings model
ollama pull nomic-embed-text

# LLM model (Llama 3.3 in this example)
ollama pull llama3.3:latest

Note: The model names are examples; use the models available in your Ollama environment.

Different versions of the Ollama Python client or model endpoints may expect either prompt= or input= when calling ollama.embeddings(...). The demo code below tries both to remain compatible across versions.

Warning about re-running the demo

When you re-run the demo, Chroma’s add() may raise an exception if the same id already exists in the collection. The demo handles this by catching the error and continuing — this is safe for quick iterations.

Create the application: app_v1.py Below is a compact and corrected version of the demo application. It includes:

Helper functions for embeddings and generation that handle Ollama client variations.
A Chroma collection getter using a persistent path and cosine similarity.
Two subcommands: init (quick environment checks) and demo (index one tiny document, retrieve, and answer a question).

# app_v1.py
import argparse
from pathlib import Path
import sys

import chromadb
import ollama

CHROMA_PATH = Path("./.chroma")
COLLECTION_NAME = "hello_rag"
LLM_MODEL = "llama3.3:latest"
EMBED_MODEL = "nomic-embed-text"


def _embed(text: str) -> list[float]:
    """
    Use Ollama embeddings. Some versions expect 'prompt=' and others 'input='.
    Try both for compatibility.
    """
    try:
        return ollama.embeddings(model=EMBED_MODEL, prompt=text)["embedding"]
    except TypeError:
        return ollama.embeddings(model=EMBED_MODEL, input=text)["embedding"]


def _generate(prompt: str) -> str:
    out = ollama.generate(model=LLM_MODEL, prompt=prompt, stream=False)
    return out.get("response", "")


def _get_collection():
    """
    Create or get a persistent Chroma collection. Use cosine space for similarity.
    """
    client = chromadb.PersistentClient(path=str(CHROMA_PATH))
    return client.get_or_create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"},
    )


def cmd_init():
    print("=== Init: environment check ===")

    # 1) Quick embedding check
    emb = _embed("hello world")
    print(f"Embedding length: {len(emb)} (OK)")

    # 2) Quick LLM generation check
    resp = _generate("Reply with: RAG ready.")
    print(f"LLM said: {resp.strip()}")

    # 3) Quick Chroma check
    col = _get_collection()
    print(f"Chroma collection: {col.name} (OK)")

    print("Init complete ✅")


def cmd_demo():
    print("=== Demo: the tiniest RAG you can run ===")

    # 0) One tiny 'document' in memory (no files yet)
    doc_text = (
        "Runbook: Payments Service\n"
        "- SLO: p95 latency 200ms; error rate <0.1%\n"
        "- Rollback: run scripts/rollback.sh\n"
        "- Escalation: page #oncall and notify SRE.\n"
    )
    doc_id = "doc-1"
    print("Indexing 1 tiny in-memory doc...")

    # 1) Store in Chroma with our own embedding
    col = _get_collection()
    doc_emb = _embed(doc_text)
    try:
        col.add(
            ids=[doc_id],
            documents=[doc_text],
            embeddings=[doc_emb],
            metadatas=[{"source": "in-memory-demo"}],
        )
    except Exception as e:
        # If you rerun, the id might already exist - safe to ignore for this demo
        print(f"(note) add() raised {e!r}; continuing")

    # 2) Ask a question
    question = "What is the p95 latency target?"
    print(f"\nQ: {question}")

    # 3) Retrieve by semantic similarity (top 1)
    q_emb = _embed(question)
    result = col.query(
        query_embeddings=[q_emb],
        n_results=1,
        include=["documents", "metadatas", "distances"],
    )

    # Extract the returned context
    ctx = result["documents"][0][0]
    dist = result["distances"][0][0]
    src = result["metadatas"][0][0].get("source", "unknown")
    print(f"\nRetrieved context (dist={dist:.3f}, source={src}):\n\n{ctx}\n---\n")

    # 4) Ground the LLM on that context (simple prompt enforcing context use)
    prompt = (
        "You are a helpful assistant. Answer the QUESTION using ONLY the CONTEXT. "
        "If the answer is not in the context, say you don't know.\n\n"
        f"CONTEXT:\n{ctx}\n\n"
        f"QUESTION: {question}\n"
        "FINAL ANSWER:"
    )
    answer = _generate(prompt)
    print("Answer:\n", answer.strip())
    print("\nDemo complete ✅")


def main(argv=None):
    parser = argparse.ArgumentParser(
        description="Hello RAG (lesson): simple Ollama + Chroma demo."
    )
    sub = parser.add_subparsers(dest="cmd", required=True)
    sub.add_parser("init", help="Check Ollama (LLM+embeddings) and Chroma reachability")
    sub.add_parser("demo", help="Index one doc and answer one question")

    args = parser.parse_args(argv)

    if args.cmd == "init":
        cmd_init()
    elif args.cmd == "demo":
        cmd_demo()


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\nInterrupted.")
        sys.exit(1)

Run the demo

Run the init check:

python app_v1.py init

Expected output (example):

=== Init: environment check ===
Embedding length: 768 (OK)
LLM said: RAG ready.
Chroma collection: hello_rag (OK)
Init complete ✅

Run the tiny RAG demo:

python app_v1.py demo

Expected output (example):

=== Demo: the tiniest RAG you can run ===
Indexing 1 tiny in-memory doc...

Q: What is the p95 latency target?

Retrieved context (dist=0.369, source=in-memory-demo):

Runbook: Payments Service
- SLO: p95 latency 200ms; error rate <0.1%
- Rollback: run scripts/rollback.sh
- Escalation: page #oncall and notify SRE.
---
Answer:
The p95 latency target is 200ms.

Demo complete ✅

What this demonstrates

Ollama (local) can produce embeddings and generate text for grounding answers.
ChromaDB persists embeddings and returns semantically similar documents.
Minimal RAG flow: query → embed → retrieve → LLM (grounded on retrieved context) → answer.

Next steps

Chunk and ingest larger documents from the data/ folder with overlap-aware chunking.
Improve prompt engineering and retrieval strategies (e.g., hybrid search, reranking).
Evaluate retrieval accuracy and build hallucination mitigation strategies.
Consider model selection and latency trade-offs for production deployments.

Links and references

Introduction

Document Processing and Chunking

Keyword Search & Retrieval

Semantic Search & Embeddings

Vector Databases

Building the RAG Pipeline

Watch Video