Demo Performing Similarity Search

In this guide, you’ll learn how to convert text into numerical vectors (embeddings) using OpenAI’s text-embedding-ada-002 model and perform similarity searches with NumPy. This technique is essential for building semantic search, recommendation engines, and context-aware chatbots.

1. Setup

1.1 Install Dependencies

Make sure you have the OpenAI SDK and NumPy installed:

pip install openai numpy

1.2 Import Libraries and Define Helper

import openai
import numpy as np

def text_embedding(text: str) -> list[float]:
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response["data"][0]["embedding"]

Note

Each embedding from text-embedding-ada-002 has a fixed dimension of 1536, regardless of the input length.

2. Sample Phrases

We’ll use four phrases that share keywords but differ in meaning:

Phrase	Context
"Most of the websites provide the users with the choice of accepting or denying cookies"	Web cookies
"Olivia went to the bank to open a savings account"	Financial bank
"Sam sat under a tree that was on the bank of a river"	River bank
"John's cookies were only half-baked but he still carries them for Mary"	Edible cookies

3. Generating Embeddings

Convert each phrase to its embedding vector:

phrases = [
    "Most of the websites provide the users with the choice of accepting or denying cookies",
    "Olivia went to the bank to open a savings account",
    "Sam sat under a tree that was on the bank of a river",
    "John's cookies were only half-baked but he still carries them for Mary"
]

embeddings = [text_embedding(p) for p in phrases]
print(f"Embedding dimension: {len(embeddings[0])}")  # Expect 1536

4. Defining Cosine Similarity

Cosine similarity measures the angle between two vectors in the semantic space. Identical vectors yield a score of 1.0.

def vector_similarity(vec1: list[float], vec2: list[float]) -> float:
    a, b = np.array(vec1), np.array(vec2)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

5. Running Similarity Searches

Define a function to find the most similar phrase from our list:

def find_most_similar(query: str) -> str:
    q_emb = text_embedding(query)
    scores = [vector_similarity(q_emb, emb) for emb in embeddings]
    ranked = sorted(zip(scores, phrases), reverse=True, key=lambda x: x[0])
    best_score, best_phrase = ranked[0]
    print(f"Query: {query!r}\nBest match ({best_score:.2f}): {best_phrase}\n")
    return best_phrase

5.1 Example Queries

find_most_similar("Sam sat under a tree that was on the bank of a river")
find_most_similar("Mary got the biscuits from John that were not fully baked")
find_most_similar("It's recommended to put your savings in a financial institution")
find_most_similar("You get refreshed when you spend time with nature")
find_most_similar("Cookies are covered by GDPR if they collect information about users that could be used to identify them")

Expected outputs:

Exact riverbank match → similarity ≈ 1.00
Biscuits (edible cookies) → ≈ 0.92
Financial advice → ≈ 0.84
Nature reference → ≈ 0.82
GDPR cookies → ≈ 0.83

6. Discussion

Embeddings capture semantic context, not just surface-level keywords.
All vectors have the same dimensionality (1536) to sit in a common embedding space.
Cosine similarity retrieves items by meaning, not by exact word overlap.

Note

This approach powers many AI-driven features such as semantic search, recommendation engines, and dynamic context for chatbots.

Experiment by adding new phrases, querying different sentences, and watching how similarity scores adapt to meaning.

Links and References

Watch Video

Watch video content