KodeKloud Notes

Embeddings convert text into high-dimensional vectors that preserve semantic meaning, enabling powerful use cases such as semantic search, document clustering, and recommendation systems. In this guide, we’ll walk through the process of generating embeddings using OpenAI’s API via LangChain.

Prerequisites

Python 3.7+
openai and langchain Python packages
An OpenAI API key

pip install openai langchain

Warning

Make sure your OPENAI_API_KEY environment variable is set before running the examples:

export OPENAI_API_KEY="your_api_key_here"

1. Import and Initialize the Embedding Model

Begin by importing the OpenAIEmbeddings class from LangChain and creating an instance with your chosen model. Replace "text-embedding-3-large" with any supported model from the OpenAI Embeddings guide.

from langchain.embeddings import OpenAIEmbeddings

# Initialize the embeddings client
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Model Selection

Different models produce vectors of varying dimensions and performance characteristics. Refer to the OpenAI documentation for details on each embedding model.

2. Prepare the Input Documents

In a real application, you’d split your source (e.g., PDF, web pages) into text chunks. For demonstration, we’ll use four sports headlines:

Document Index	Headline	Category
1	Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship	Cricket
2	Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns	Football
3	Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup	Cricket
4	From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars	Football

docs = [
    "Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship",
    "Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns",
    "Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup",
    "From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars"
]

3. Generate Embeddings

Use the embed_documents method to convert each string into its corresponding vector. Depending on the number of documents, this may take a few seconds.

embed_docs = embeddings.embed_documents(docs)

4. Inspect and Validate the Results

4.1 Confirm Count

Ensure that the number of embeddings matches the number of input documents:

len(embed_docs)
# Output: 4

4.2 View a Sample Embedding

Each embedding is a list of floats. For instance, the first document’s embedding might look like this:

embed_docs[0]
# [
#  -0.0257435211110519,
#   0.03683468933443764,
#  -0.05297664824695986,
#   0.02100751509421706,
#   ...
# ]

4.3 Check Embedding Dimensions

Determine the dimensionality of the vectors your model produces:

len(embed_docs[0])
# Output: 3072

Vector Dimensions

Vector size varies by model. For example, text-embedding-3-large outputs 3072-dimensional embeddings. Always verify dimensions before storing in a vector database.

5. Next Steps

With embeddings generated, the typical workflow involves:

Storing vectors in a vector database (e.g., Pinecone, Weaviate, or Chroma).
Performing similarity searches to retrieve semantically related documents.
Building applications like semantic search engines or Q&A chatbots.

In the next chapter, we’ll cover setting up a vector database, indexing embeddings, and executing real-time semantic queries.

References

Watch Video

Watch video content