LangChain
Performing Retrieval
Generating Emeddings
Embeddings convert text into high-dimensional vectors that preserve semantic meaning, enabling powerful use cases such as semantic search, document clustering, and recommendation systems. In this guide, we’ll walk through the process of generating embeddings using OpenAI’s API via LangChain.
Prerequisites
- Python 3.7+
openai
andlangchain
Python packages- An OpenAI API key
pip install openai langchain
Warning
Make sure your OPENAI_API_KEY
environment variable is set before running the examples:
export OPENAI_API_KEY="your_api_key_here"
1. Import and Initialize the Embedding Model
Begin by importing the OpenAIEmbeddings
class from LangChain and creating an instance with your chosen model. Replace "text-embedding-3-large"
with any supported model from the OpenAI Embeddings guide.
from langchain.embeddings import OpenAIEmbeddings
# Initialize the embeddings client
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
Model Selection
Different models produce vectors of varying dimensions and performance characteristics. Refer to the OpenAI documentation for details on each embedding model.
2. Prepare the Input Documents
In a real application, you’d split your source (e.g., PDF, web pages) into text chunks. For demonstration, we’ll use four sports headlines:
Document Index | Headline | Category |
---|---|---|
1 | Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship | Cricket |
2 | Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns | Football |
3 | Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup | Cricket |
4 | From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars | Football |
docs = [
"Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship",
"Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns",
"Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup",
"From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars"
]
3. Generate Embeddings
Use the embed_documents
method to convert each string into its corresponding vector. Depending on the number of documents, this may take a few seconds.
embed_docs = embeddings.embed_documents(docs)
4. Inspect and Validate the Results
4.1 Confirm Count
Ensure that the number of embeddings matches the number of input documents:
len(embed_docs)
# Output: 4
4.2 View a Sample Embedding
Each embedding is a list of floats. For instance, the first document’s embedding might look like this:
embed_docs[0]
# [
# -0.0257435211110519,
# 0.03683468933443764,
# -0.05297664824695986,
# 0.02100751509421706,
# ...
# ]
4.3 Check Embedding Dimensions
Determine the dimensionality of the vectors your model produces:
len(embed_docs[0])
# Output: 3072
Vector Dimensions
Vector size varies by model. For example, text-embedding-3-large
outputs 3072-dimensional embeddings. Always verify dimensions before storing in a vector database.
5. Next Steps
With embeddings generated, the typical workflow involves:
- Storing vectors in a vector database (e.g., Pinecone, Weaviate, or Chroma).
- Performing similarity searches to retrieve semantically related documents.
- Building applications like semantic search engines or Q&A chatbots.
In the next chapter, we’ll cover setting up a vector database, indexing embeddings, and executing real-time semantic queries.
References
Watch Video
Watch video content