Guide to building a minimal RAG pipeline using Ollama for embeddings and generation and ChromaDB for vector storage, indexing one in-memory document and producing grounded answers with local LLM
This guide walks you through building a minimal Retrieval-Augmented Generation (RAG) pipeline end-to-end using Ollama for embedding and generation and ChromaDB for vector storage and retrieval. The goal is a compact, runnable demo that indexes a single in-memory document, retrieves it by semantic similarity, and produces a grounded answer from a local LLM.Goals
Use Ollama for embeddings and LLM generation.
Use ChromaDB (Chroma) for persistence and similarity search.
Run a minimal RAG round trip with one in-memory document.
Pull the Ollama models you plan to use locally (examples shown):
# Embeddings modelollama pull nomic-embed-text# LLM model (Llama 3.3 in this example)ollama pull llama3.3:latest
Note: The model names are examples; use the models available in your Ollama environment.
Different versions of the Ollama Python client or model endpoints may expect either prompt= or input= when calling ollama.embeddings(...). The demo code below tries both to remain compatible across versions.
Warning about re-running the demo
When you re-run the demo, Chroma’s add() may raise an exception if the same id already exists in the collection. The demo handles this by catching the error and continuing — this is safe for quick iterations.
Create the application: app_v1.py
Below is a compact and corrected version of the demo application. It includes:
Helper functions for embeddings and generation that handle Ollama client variations.
A Chroma collection getter using a persistent path and cosine similarity.
Two subcommands: init (quick environment checks) and demo (index one tiny document, retrieve, and answer a question).
# app_v1.pyimport argparsefrom pathlib import Pathimport sysimport chromadbimport ollamaCHROMA_PATH = Path("./.chroma")COLLECTION_NAME = "hello_rag"LLM_MODEL = "llama3.3:latest"EMBED_MODEL = "nomic-embed-text"def _embed(text: str) -> list[float]: """ Use Ollama embeddings. Some versions expect 'prompt=' and others 'input='. Try both for compatibility. """ try: return ollama.embeddings(model=EMBED_MODEL, prompt=text)["embedding"] except TypeError: return ollama.embeddings(model=EMBED_MODEL, input=text)["embedding"]def _generate(prompt: str) -> str: out = ollama.generate(model=LLM_MODEL, prompt=prompt, stream=False) return out.get("response", "")def _get_collection(): """ Create or get a persistent Chroma collection. Use cosine space for similarity. """ client = chromadb.PersistentClient(path=str(CHROMA_PATH)) return client.get_or_create_collection( name=COLLECTION_NAME, metadata={"hnsw:space": "cosine"}, )def cmd_init(): print("=== Init: environment check ===") # 1) Quick embedding check emb = _embed("hello world") print(f"Embedding length: {len(emb)} (OK)") # 2) Quick LLM generation check resp = _generate("Reply with: RAG ready.") print(f"LLM said: {resp.strip()}") # 3) Quick Chroma check col = _get_collection() print(f"Chroma collection: {col.name} (OK)") print("Init complete ✅")def cmd_demo(): print("=== Demo: the tiniest RAG you can run ===") # 0) One tiny 'document' in memory (no files yet) doc_text = ( "Runbook: Payments Service\n" "- SLO: p95 latency 200ms; error rate <0.1%\n" "- Rollback: run scripts/rollback.sh\n" "- Escalation: page #oncall and notify SRE.\n" ) doc_id = "doc-1" print("Indexing 1 tiny in-memory doc...") # 1) Store in Chroma with our own embedding col = _get_collection() doc_emb = _embed(doc_text) try: col.add( ids=[doc_id], documents=[doc_text], embeddings=[doc_emb], metadatas=[{"source": "in-memory-demo"}], ) except Exception as e: # If you rerun, the id might already exist - safe to ignore for this demo print(f"(note) add() raised {e!r}; continuing") # 2) Ask a question question = "What is the p95 latency target?" print(f"\nQ: {question}") # 3) Retrieve by semantic similarity (top 1) q_emb = _embed(question) result = col.query( query_embeddings=[q_emb], n_results=1, include=["documents", "metadatas", "distances"], ) # Extract the returned context ctx = result["documents"][0][0] dist = result["distances"][0][0] src = result["metadatas"][0][0].get("source", "unknown") print(f"\nRetrieved context (dist={dist:.3f}, source={src}):\n\n{ctx}\n---\n") # 4) Ground the LLM on that context (simple prompt enforcing context use) prompt = ( "You are a helpful assistant. Answer the QUESTION using ONLY the CONTEXT. " "If the answer is not in the context, say you don't know.\n\n" f"CONTEXT:\n{ctx}\n\n" f"QUESTION: {question}\n" "FINAL ANSWER:" ) answer = _generate(prompt) print("Answer:\n", answer.strip()) print("\nDemo complete ✅")def main(argv=None): parser = argparse.ArgumentParser( description="Hello RAG (lesson): simple Ollama + Chroma demo." ) sub = parser.add_subparsers(dest="cmd", required=True) sub.add_parser("init", help="Check Ollama (LLM+embeddings) and Chroma reachability") sub.add_parser("demo", help="Index one doc and answer one question") args = parser.parse_args(argv) if args.cmd == "init": cmd_init() elif args.cmd == "demo": cmd_demo()if __name__ == "__main__": try: main() except KeyboardInterrupt: print("\nInterrupted.") sys.exit(1)
=== Demo: the tiniest RAG you can run ===Indexing 1 tiny in-memory doc...Q: What is the p95 latency target?Retrieved context (dist=0.369, source=in-memory-demo):Runbook: Payments Service- SLO: p95 latency 200ms; error rate <0.1%- Rollback: run scripts/rollback.sh- Escalation: page #oncall and notify SRE.---Answer:The p95 latency target is 200ms.Demo complete ✅
What this demonstrates
Ollama (local) can produce embeddings and generate text for grounding answers.
ChromaDB persists embeddings and returns semantically similar documents.