LangChain
Performing Retrieval
Performing Semantic Search
In this tutorial, you’ll learn how to generate embeddings using OpenAI, store them in a Chroma vector database, and execute simple semantic searches. By the end, you’ll understand how to retrieve conceptually related documents—even when they share no exact keywords.
Table of Contents
- Prerequisites
- Step 1: Install and Import Dependencies
- Step 2: Prepare Your Documents
- Step 3: Create a Chroma Vector Store
- Step 4: Execute Semantic Queries
- How It Works
- Further Reading
Prerequisites
- Python 3.7+
- An OpenAI API key
- Install the required libraries:
pip install langchain chromadb openai
Note
Ensure your OPENAI_API_KEY
environment variable is set:
export OPENAI_API_KEY="your_api_key_here"
Step 1: Install and Import Dependencies
Begin by importing LangChain’s embedding model and the Chroma vector store:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Initialize the embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
Step 2: Prepare Your Documents
Here, we create a small collection of sports headlines. In real applications, you might load text from files, PDFs, or a database.
docs = [
"Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship",
"Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns",
"Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup",
"From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars"
]
Step 3: Create a Chroma Vector Store
Chroma will automatically generate embeddings for each document and store them in a local vector database:
vectorstore = Chroma.from_texts(texts=docs, embedding=embeddings)
Component | Description | Example |
---|---|---|
Embeddings | Converts text into high-dimensional vectors | OpenAIEmbeddings(model="text-embedding-ada-002") |
Vector Database | Stores and indexes embedding vectors for similarity ops | Chroma.from_texts(texts=docs, embedding=embeddings) |
Step 4: Execute Semantic Queries
With your documents indexed, you can now query the vector store. Semantic search will return contextually related headlines, even without shared keywords.
Cricket Query
results_cricket = vectorstore.similarity_search("Rohit Sharma", k=2)
for doc in results_cricket:
print(doc.page_content)
Output:
Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup
Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship
Football Query
results_football = vectorstore.similarity_search("Lionel Messi", k=2)
for doc in results_football:
print(doc.page_content)
Output:
From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars
Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns
Tip
Adjust k
to change the number of returned results. For instance, k=1
returns only the single most similar document.
How It Works
- Embedding Generation
Both documents and queries are transformed into vector embeddings by the same model. - Vector Similarity
Chroma computes distances between the query vector and stored document vectors, retrieving the top-k
closest matches. - Semantic Matching
Unlike keyword-based search, semantic search finds conceptually related content—even if the exact terms differ.
Further Reading
Feel free to integrate this pattern into your QA systems, recommendation engines, or any application requiring intelligent text retrieval.
Watch Video
Watch video content