Minimal demo ingesting local .md and .txt files into Chroma with Ollama embeddings and LLM, performing chunking, semantic search, prompt building, citation, and simple CLI commands.
This guide demonstrates a minimal, practical retrieval-augmented generation (RAG) pipeline using Ollama for embeddings and generation and Chroma for vector storage and search. We move from an in-memory demo to ingesting files from disk, showing a simple end-to-end flow:
Read .md/.txt files from a data/ folder.
Chunk documents into paragraph-style pieces.
Embed chunks with Ollama and persist embeddings in Chroma.
At query time embed the user question, retrieve top-k similar chunks from Chroma, build a prompt that forces the model to answer only from the returned context, and include citations.
This example is intentionally simple (no batching, no BM25 or fancy optimizations) to keep it easy to extend.Below we walk through the important pieces of app_v2.py. The full script is included in the sections below.
These helpers handle embeddings and text generation. The embedding helper supports both prompt= and input= parameter styles used by different Ollama client versions.
def _embed(text: str) -> List[float]: """Return an embedding vector for the given text. Support either prompt= or input=.""" try: return ollama.embeddings(model=EMBED_MODEL, prompt=text)["embedding"] except TypeError: return ollama.embeddings(model=EMBED_MODEL, input=text)["embedding"]def _generate(prompt: str) -> str: """Generate text for a prompt using the chosen LLM model.""" out = ollama.generate(model=LLM_MODEL, prompt=prompt, stream=False) return out.get("response", "")
The _embed helper attempts both parameter styles to maintain compatibility across Ollama client versions. If you control the client, pick one style and simplify the helper.
The chunking strategy below splits documents into paragraphs and packs them greedily into chunks with optional overlap. Paragraph-based chunks are small, document-like, and work well for many short-doc corpora.
def _split_paragraphs(text: str) -> List[str]: parts = [p.strip() for p in text.replace("\r\n", "\n").split("\n\n")] return [p for p in parts if p]def make_chunks(text: str, max_chars: int = 800, overlap: int = 150) -> List[str]: """Greedy paragraph packer with overlap between chunks.""" paras = _split_paragraphs(text) chunks, buf, total = [], [], 0 for p in paras: # +2 approximates the newline chars we will add when joining if buf and total + len(p) + 2 > max_chars: chunk = "\n\n".join(buf) chunks.append(chunk) tail = chunk[-overlap:] if overlap > 0 else "" buf = [tail] if tail else [] total = len(tail) buf.append(p) total += len(p) + 2 if buf: chunks.append("\n\n".join(buf)) return chunks
Tune max_chars and overlap for your documents. Paragraph packing keeps chunks coherent and readable by the model.
Format the retrieved chunks into a context block and construct a deterministic prompt that instructs the LLM to answer ONLY from that context and to cite the sources.
def _build_prompt(question: str, hits: list[dict]) -> Tuple[str, list[str]]: blocks = [] citations = [] for i, h in enumerate(hits, 1): blocks.append(f"Source {i}:\n{h['text']}\n") src = f"[{i}] {h['meta'].get('source', 'unknown')}#chunk-{h['meta'].get('chunk', 0)}" citations.append(src) ctx = "\n\n".join(blocks) prompt = ( "You are a helpful assistant for DevOps teams. " "Answer the QUESTION using ONLY the CONTEXT. " "If the answer is not in the context, say you don't know. " "Cite sources in the form [1], [2], etc.\n\n" f"CONTEXT:\n{ctx}\n\n" f"QUESTION: {question}\n" "FINAL ANSWER:" ) return prompt, citations
Returning the citation strings separately makes CLI printing and logging easier.
This command ingests all .md/.txt files under a directory:
Read files from disk
Chunk documents
Create deterministic chunk IDs based on file path + chunk content
Embed chunks and add to Chroma, skipping duplicates
def cmd_ingest(dir_path: Path): col = _get_collection() to_add_ids = [] to_add_docs = [] to_add_metas = [] to_add_embs = [] total_chunks = 0 for p in _iter_files(dir_path): text = p.read_text(encoding="utf-8") chunks = make_chunks(text) for i, chunk in enumerate(chunks): # deterministic id based on file path + chunk text digest = hashlib.sha256(f"{p}:{i}:{chunk}".encode("utf-8")).hexdigest() chunk_id = f"{p}#chunk-{i}-{digest[:8]}" meta = {"source": str(p), "chunk": i} # Skip if this id already exists try: existing = col.get(ids=[chunk_id]) if existing and existing.get("ids"): # id exists, skip continue except Exception: # Some Chroma clients may raise if not found; ignore and proceed pass emb = _embed(chunk) to_add_ids.append(chunk_id) to_add_docs.append(chunk) to_add_metas.append(meta) to_add_embs.append(emb) total_chunks += 1 if to_add_docs: col.add( ids=to_add_ids, documents=to_add_docs, metadatas=to_add_metas, embeddings=to_add_embs, ) print(f"Ingestion complete. {total_chunks}/{total_chunks} chunks stored.")
Deterministic chunk IDs allow safe re-ingestion: the script skips chunks already present in Chroma. If your Chroma client supports efficient upserts, you can swap the existence check for an upsert workflow.
At query time we run semantic search, build the prompt with the returned chunks, call the LLM, and print the model’s answer along with a list of cited sources.
def cmd_ask(question: str, k: int): hits = _semantic_search(question, k=k) if not hits: print("No results found. Did you run ingest?") return prompt, citations = _build_prompt(question, hits) answer = _generate(prompt) print("\n=== Answer ===") print(answer.strip()) print("\n=== Sources ===") for s in citations: print(s)
Utilities to inspect and reset the local Chroma index:
def cmd_stats(): col = _get_collection() try: count = col.count() except Exception: # Some Chroma clients may not support count(); fall back to unknown. count = "unknown" print(f"Chunks in collection: {count}")def cmd_reset(): if CHROMA_PATH.exists(): shutil.rmtree(CHROMA_PATH) print(f"Removed {CHROMA_PATH} (index reset).") else: print("Nothing to reset.")
(.venv) jeremy@MACSTUDIO BookSearch % python app_v2.py statsChunks in collection: 2
Ask a question using top-k retrieval:
(.venv) jeremy@MACSTUDIO BookSearch % python app_v2.py ask "How do I roll back the service?"=== Answer ===To roll back the service, you should run scripts/rollback.sh.=== Sources ===[1] data/oncall.md#chunk-0
Reset the index, then ask (shows the behavior when there is no index):
(.venv) jeremy@MACSTUDIO BookSearch % python app_v2.py resetRemoved ./.chroma (index reset).(.venv) jeremy@MACSTUDIO BookSearch % python app_v2.py ask "How do I roll back the service?"No results found. Did you run ingest?
Re-run ingest to rebuild the index and queries will work again.
Replaced the tiny in-memory demo with a simple file-based ingestion pipeline.
Implemented deterministic chunk IDs so re-ingests skip duplicates.
Used top-k retrieval from Chroma to build a deterministic context for the LLM.
Instructed the LLM to answer only from the provided context and to include citations.
Kept the code intentionally minimal so you can extend it for batching, semantic chunking, richer metadata, or different embeddings/LLM providers.
This demo is for local testing and small datasets. For production, consider secure deployment of Ollama/Chroma, robust error handling, rate limits, batching, and privacy/PII considerations for ingested content.