Guide to building a production-ready semantic search engine using embeddings, document chunking, ChromaDB and LangChain, with examples for embedding creation, indexing, and similarity search.
We’re going to build a semantic search engine step-by-step.The story begins with TechDocs, Inc., where users search through documentation 10,000 times a day. More than half of those searches fail because traditional keyword search can’t connect related phrases like “reset password” and “password recovery.”Our mission is to fix that by building a search system that understands meaning, not just words.
Embeddings are the backbone of semantic search. Rather than working with individual keywords, embeddings convert text into dense numerical vectors where semantically similar texts are close in vector space. That enables the search engine to connect queries and documents that use different words but share meaning.
This concise example demonstrates loading a sentence-transformers model, encoding a query and several documents, computing cosine similarity, and printing results. Normalizing embeddings (normalize_embeddings=True) can improve cosine-similarity stability.
Copy
# task_1_understanding_embeddings.pyfrom sentence_transformers import SentenceTransformer, utilimport osdef main(): model = SentenceTransformer("all-MiniLM-L6-v2") query = "forgot my password" docs = [ "Password recovery: Use the 'Reset Password' link on login page", "Vacation policy: Request time off 2 weeks in advance", "Account security: Enable two-factor authentication", "Login help: Contact IT if you cannot access your account" ] # Encode query and documents query_emb = model.encode(query, convert_to_tensor=True) doc_embs = model.encode(docs, convert_to_tensor=True) # Compute cosine similarity scores scores = util.cos_sim(query_emb, doc_embs)[0] print(f"Query: '{query}'\n") print("Results (score > 0.3 = relevant):") for doc, score in zip(docs, scores): marker = "✅" if score.item() > 0.3 else " " print(f"{marker} [{score.item():.2f}] {doc}") print("\n🔎 Notice: Found 'Password recovery' and 'Login help'") print(" Even though the query didn't contain those exact words!") os.makedirs("/root/markers", exist_ok=True) open("/root/markers/task1_embeddings_complete.txt", "w").write("DONE")if __name__ == "__main__": main()
Example output (abridged):
Copy
Query: 'forgot my password'Results (score > 0.3 = relevant):✅ [0.56] Password recovery: Use the 'Reset Password' link on login page [0.07] Vacation policy: Request time off 2 weeks in advance✅ [0.31] Account security: Enable two-factor authentication✅ [0.60] Login help: Contact IT if you cannot access your account🔎 Notice: Found 'Password recovery' and 'Login help'Even though the query didn't contain those exact words!
Large documents should be split into smaller chunks for embedding for two reasons:
Embedding models have context limits; extremely long texts can be truncated or produce noisy embeddings.
Smaller, focused chunks preserve local context and improve retrieval accuracy.
However, naive splitting may cut sentences and lose meaning. Use overlapping chunks to preserve sentence continuity at boundaries. A common starting point is ~500 characters per chunk with ~100-character overlap; tune this for your documents and model.
Example using LangChain’s RecursiveCharacterTextSplitter:
Copy
# task_2_chunking.pyfrom langchain.text_splitter import RecursiveCharacterTextSplitterdef chunk_document(text, chunk_size=500, chunk_overlap=100): splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=["\n\n", "\n", " ", ""] ) return splitter.split_text(text)if __name__ == "__main__": long_text = "..." # replace with actual document text chunks = chunk_document(long_text) for i, chunk in enumerate(chunks, start=1): print(f"📄 Chunk {i} ({len(chunk)} chars):\n{chunk[:200]}...\n") print("✅ Task 2 completed! Document chunking mastered.")
• Preserves sentence boundaries
• Maintains context with overlap
• Optimizes chunks for embedding models
• Can improve retrieval accuracy significantly
Embeddings are vectors; we need a vector store to index and search them efficiently. ChromaDB is a production-ready vector database that supports fast similarity search and metadata filtering. LangChain integrates with ChromaDB to simplify storing and querying embeddings.How vector search works (high-level):
Create a Chroma vector store and index documents (Task 3)
This example shows how to initialize HuggingFace embeddings via LangChain, create Document objects, and build a Chroma vector store in a persistent temporary directory.
Copy
# task_3_build_vectorstore.pyfrom langchain_community.vectorstores import Chromafrom langchain_huggingface import HuggingFaceEmbeddingsfrom langchain.schema import Documentimport tempfileimport osdef build_vectorstore(doc_texts): # Initialize embeddings (HuggingFace wrapper) embeddings = HuggingFaceEmbeddings( model_name="all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True} ) # Create Document objects (optional metadata can be added) documents = [Document(page_content=text) for text in doc_texts] # Persist vector store to a directory (use mkdtemp so the directory remains available # after this function returns; TemporaryDirectory would be removed on exit) temp_dir = tempfile.mkdtemp() vectorstore = Chroma.from_documents( documents=documents, embedding=embeddings, persist_directory=temp_dir ) # Ensure data is persisted to disk if the vectorstore supports it try: vectorstore.persist() except Exception: # Some vectorstore wrappers persist automatically; ignore if not applicable pass print(f"📚 Loaded {len(documents)} documents into vector store at {temp_dir}...") return vectorstoreif __name__ == "__main__": sample_docs = [ "Remote work policy allows employees to work from home up to 3 days per week with manager approval.", "Work hours are flexible but core hours 10 AM to 3 PM are required.", "Health insurance covers employee and dependents with company paying 80% of premiums.", ] vs = build_vectorstore(sample_docs) print("✅ Vector store ready!")
Now implement the search pipeline: convert the user query to an embedding, query the ChromaDB vector store for the top-K similar chunks, optionally filter by a score threshold, and return the best results.
This example assumes you have a built vectorstore (as in Task 3). It shows how to run a similarity search, obtain scores, apply a threshold, and print filtered results.
Copy
# task_4_semantic_search.pyimport tempfilefrom langchain_community.vectorstores import Chromafrom langchain_huggingface import HuggingFaceEmbeddingsfrom langchain.schema import Documentdef build_search_engine(knowledge_base): # Initialize embeddings and documents embeddings = HuggingFaceEmbeddings( model_name="all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True} ) documents = [Document(page_content=text) for text in knowledge_base] # Create vector store in a temporary directory (searching is performed while directory exists) with tempfile.TemporaryDirectory() as temp_dir: vectorstore = Chroma.from_documents( documents=documents, embedding=embeddings, persist_directory=temp_dir ) print("✅ Vector store ready!\n") # Search configuration search_query = "work from home policy" k = 3 score_threshold = 0.5 print(f"● Searching for: '{search_query}'") print(f"Returning top {k} results") print("-" * 28) # Basic similarity search (returns Documents) results = vectorstore.similarity_search(search_query, k=k) print("\n📚 Search Results:\n") for i, doc in enumerate(results, 1): print(f"{i}. {doc.page_content}\n") # Search with scores (returns list of (Document, score)) # Note: when embeddings are normalized and the vector store returns cosine similarity, # higher scores indicate more similar results. results_with_scores = vectorstore.similarity_search_with_score(search_query, k=5) relevant_results = [(doc, score) for doc, score in results_with_scores if score >= score_threshold] print(f"\n🔎 Filtered Search (threshold > {score_threshold}):") print("-" * 40) if relevant_results: for i, (doc, score) in enumerate(relevant_results, 1): print(f"{i}. [{score:.2f}] {doc.page_content}\n") else: print("No results above the score threshold.")if __name__ == "__main__": knowledge_base = [ "Remote work policy allows employees to work from home up to 3 days per week with manager approval.", "Work hours are flexible but core hours 10 AM to 3 PM are required.", "Health insurance covers employee and dependents with company paying 80% of premiums.", # ... additional docs ... ] build_search_engine(knowledge_base)
Simulated example run summary:
Copy
Task 4: Semantic Search Implementation=====================================✅ Loading 12 documents into vector store...✔ Vector store ready!● Searching for: 'work from home policy'Returning top 3 results----------------------------📚 Search Results:1. Remote work policy allows employees to work from home up to 3 days per week with manager approval.2. Work hours are flexible but core hours 10 AM to 3 PM are required.3. Health insurance covers employee and dependents with company paying 80% of premiums.🔎 Filtered Search (threshold > 0.5):
Set up an environment for embeddings and vector search.
Learned how embeddings capture semantic similarity beyond keywords.
Implemented smart, overlapping document chunking.
Built a ChromaDB-backed vector store and indexed document chunks.
Implemented a semantic search pipeline that converts queries to embeddings, performs similarity search, and returns ranked, filtered results.
Next experiments to improve relevance and production-readiness:
Try different embedding models (speed vs. accuracy tradeoffs).
Tune chunk sizes and overlap parameters based on document structure.
Persist vector stores to a stable location and design a scalable deployment.
Add metadata filtering (document type, last-updated) and combine with a ranker or reranker for hybrid retrieval.
Next steps: experiment with model variants, tune chunking/thresholds, and add metadata filters (e.g., document type, last-updated) to further improve relevance.