KodeKloud Notes

In this tutorial, you’ll learn how to generate embeddings using OpenAI, store them in a Chroma vector database, and execute simple semantic searches. By the end, you’ll understand how to retrieve conceptually related documents—even when they share no exact keywords.

Prerequisites
Step 1: Install and Import Dependencies
Step 2: Prepare Your Documents
Step 3: Create a Chroma Vector Store
Step 4: Execute Semantic Queries
How It Works
Further Reading

Prerequisites

Python 3.7+
An OpenAI API key
Install the required libraries:

pip install langchain chromadb openai

Note

Ensure your OPENAI_API_KEY environment variable is set:

export OPENAI_API_KEY="your_api_key_here"

Step 1: Install and Import Dependencies

Begin by importing LangChain’s embedding model and the Chroma vector store:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Initialize the embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

Step 2: Prepare Your Documents

Here, we create a small collection of sports headlines. In real applications, you might load text from files, PDFs, or a database.

docs = [
    "Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship",
    "Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns",
    "Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup",
    "From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars"
]

Step 3: Create a Chroma Vector Store

Chroma will automatically generate embeddings for each document and store them in a local vector database:

vectorstore = Chroma.from_texts(texts=docs, embedding=embeddings)

Component	Description	Example
Embeddings	Converts text into high-dimensional vectors	`OpenAIEmbeddings(model="text-embedding-ada-002")`
Vector Database	Stores and indexes embedding vectors for similarity ops	`Chroma.from_texts(texts=docs, embedding=embeddings)`

Step 4: Execute Semantic Queries

With your documents indexed, you can now query the vector store. Semantic search will return contextually related headlines, even without shared keywords.

Cricket Query

results_cricket = vectorstore.similarity_search("Rohit Sharma", k=2)
for doc in results_cricket:
    print(doc.page_content)

Output:

Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup
Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship

Football Query

results_football = vectorstore.similarity_search("Lionel Messi", k=2)
for doc in results_football:
    print(doc.page_content)

Output:

From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars
Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns

Tip

Adjust k to change the number of returned results. For instance, k=1 returns only the single most similar document.

How It Works

Embedding Generation
Both documents and queries are transformed into vector embeddings by the same model.
Vector Similarity
Chroma computes distances between the query vector and stored document vectors, retrieving the top-k closest matches.
Semantic Matching
Unlike keyword-based search, semantic search finds conceptually related content—even if the exact terms differ.

Watch Video

Watch video content

Performing Semantic Search

Table of Contents