Mastering Generative AI with OpenAI
Using Word Embeddings For Dynamic Context
Demo Performing Similarity Search
In this guide, you’ll learn how to convert text into numerical vectors (embeddings) using OpenAI’s text-embedding-ada-002
model and perform similarity searches with NumPy. This technique is essential for building semantic search, recommendation engines, and context-aware chatbots.
1. Setup
1.1 Install Dependencies
Make sure you have the OpenAI SDK and NumPy installed:
pip install openai numpy
1.2 Import Libraries and Define Helper
import openai
import numpy as np
def text_embedding(text: str) -> list[float]:
response = openai.Embedding.create(
model="text-embedding-ada-002",
input=text
)
return response["data"][0]["embedding"]
Note
Each embedding from text-embedding-ada-002
has a fixed dimension of 1536, regardless of the input length.
2. Sample Phrases
We’ll use four phrases that share keywords but differ in meaning:
Phrase | Context |
---|---|
"Most of the websites provide the users with the choice of accepting or denying cookies" | Web cookies |
"Olivia went to the bank to open a savings account" | Financial bank |
"Sam sat under a tree that was on the bank of a river" | River bank |
"John's cookies were only half-baked but he still carries them for Mary" | Edible cookies |
3. Generating Embeddings
Convert each phrase to its embedding vector:
phrases = [
"Most of the websites provide the users with the choice of accepting or denying cookies",
"Olivia went to the bank to open a savings account",
"Sam sat under a tree that was on the bank of a river",
"John's cookies were only half-baked but he still carries them for Mary"
]
embeddings = [text_embedding(p) for p in phrases]
print(f"Embedding dimension: {len(embeddings[0])}") # Expect 1536
4. Defining Cosine Similarity
Cosine similarity measures the angle between two vectors in the semantic space. Identical vectors yield a score of 1.0.
def vector_similarity(vec1: list[float], vec2: list[float]) -> float:
a, b = np.array(vec1), np.array(vec2)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
5. Running Similarity Searches
Define a function to find the most similar phrase from our list:
def find_most_similar(query: str) -> str:
q_emb = text_embedding(query)
scores = [vector_similarity(q_emb, emb) for emb in embeddings]
ranked = sorted(zip(scores, phrases), reverse=True, key=lambda x: x[0])
best_score, best_phrase = ranked[0]
print(f"Query: {query!r}\nBest match ({best_score:.2f}): {best_phrase}\n")
return best_phrase
5.1 Example Queries
find_most_similar("Sam sat under a tree that was on the bank of a river")
find_most_similar("Mary got the biscuits from John that were not fully baked")
find_most_similar("It's recommended to put your savings in a financial institution")
find_most_similar("You get refreshed when you spend time with nature")
find_most_similar("Cookies are covered by GDPR if they collect information about users that could be used to identify them")
Expected outputs:
- Exact riverbank match → similarity ≈ 1.00
- Biscuits (edible cookies) → ≈ 0.92
- Financial advice → ≈ 0.84
- Nature reference → ≈ 0.82
- GDPR cookies → ≈ 0.83
6. Discussion
- Embeddings capture semantic context, not just surface-level keywords.
- All vectors have the same dimensionality (1536) to sit in a common embedding space.
- Cosine similarity retrieves items by meaning, not by exact word overlap.
Note
This approach powers many AI-driven features such as semantic search, recommendation engines, and dynamic context for chatbots.
Experiment by adding new phrases, querying different sentences, and watching how similarity scores adapt to meaning.
Links and References
Watch Video
Watch video content