Similarity Calculations

In this lesson we explain how similarity calculations let us find the most semantically relevant passages — the “needles” in a large digital haystack — without scanning every document. This is essential for semantic search and Retrieval-Augmented Generation (RAG), where the quality of retrieved context directly affects model output reliability. A traditional keyword search matches literal characters and phrases. For example, asking about how plants communicate with a keyword search will only surface passages that contain your exact words and may miss related concepts expressed with different vocabulary.

The image shows a bookshelf filled with books and a stack of books in front, alongside text comparing traditional keyword search terms: "Plants" and "Communicate."

Because keyword matching is literal, it can miss related concepts such as root signaling, nutrient-sharing networks, or fungal connections beneath the soil — passages that are semantically relevant but do not include your exact search terms.

The image shows an illustration of a bookshelf with stacked books and a list titled "What Gets Missed," highlighting points about tree communication, forest networks, and fungal connections.

Similarity-based methods avoid this problem by measuring how close two pieces of text are in meaning rather than comparing characters. This is done by mapping words, sentences, or documents into high-dimensional vectors called embeddings. Conceptually, embeddings are like GPS coordinates in a much higher-dimensional space. A single embedding might look like:

# Example embedding vector (truncated)
[0.2, 0.8, -0.3, 0.5, ...]

Each dimension can capture latent semantic features (for example “animalness”, “size”, “maturity”, “domestication”). Related terms cluster near each other in this vector space. A simple analogy: imagine two darts thrown at a board. If they land in the same place and point the same way, the throwers were thinking similarly. In vector terms, similarity examines the direction (angle) between vectors rather than raw magnitude.

The image explains how to calculate similarity, using an example of "automobile" and "car" with nearly identical meanings, depicted with a similarity score of 1.0 and an angle of approximately 0 degrees.

The standard numerical measure for directional similarity is cosine similarity:

cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)

Here, A · B is the dot product and ||A|| is the vector norm (magnitude). Values close to 1 indicate vectors pointing in the same direction (high semantic similarity), values near 0 indicate orthogonality (little relation), and values near -1 indicate opposite directions (opposite meanings). You can compute cosine similarity easily in Python:

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Example (small synthetic vectors)
vec_cat = np.array([0.9, 0.1, 0.0])
vec_kitten = np.array([0.88, 0.12, -0.01])
vec_dog = np.array([0.7, 0.2, 0.1])

print(cosine_similarity(vec_cat, vec_kitten))  # close to 1.0
print(cosine_similarity(vec_cat, vec_dog))     # lower than cat/kitten

The image explains how to calculate similarity using vectors, highlighting that opposite meanings have a 180-degree angle with a similarity of approximately -1.0, using "hot" and "cold" as examples.

Why this matters: similarity calculations are the core of retrieval. If retrieved documents are irrelevant or only weakly related, the language model receives poor context and may generate incorrect or confidently wrong answers (hallucinations). Accurate similarity scoring improves the relevance of retrieved context, which in turn strengthens factual grounding for RAG systems.

The image displays a search query "How do I train my puppy?" with three document results showing their similarity scores. Document 2, "Puppy obedience and behavior basics," has the highest similarity score of 0.94.

High-quality similarity transforms retrieval into reliable context for generation. Poor similarity produces weak or misleading context and degrades model outputs.

The image contrasts the effects of poor versus good similarity in document retrieval, highlighting improved context and trustworthy results with good similarity.

Quick reference — interpreting cosine similarity scores:

Cosine similarity range	Interpretation
0.8 — 1.0	Very high semantic similarity (near-synonyms, same concept)
0.5 — 0.8	Moderate similarity (related topics, overlapping concepts)
0.0 — 0.5	Weak or tangential relation
-1.0 — 0.0	Opposite or unrelated meanings

For more on embeddings and cosine similarity, see:

Key takeaways:

Similarity calculations measure semantic closeness, not string equality.
Embeddings map text into vectors that capture meaning; similar meanings occupy similar directions in vector space.
Cosine similarity compares vector directions and is robust to differences in document length.
Retrieval quality determines RAG quality: good similarity → better context → more accurate model answers.

Focus on semantic meaning rather than exact keywords. Proper embeddings plus cosine-based retrieval are fundamental to building reliable, well-grounded RAG systems.

Introduction

Document Processing and Chunking

Keyword Search & Retrieval

Semantic Search & Embeddings

Vector Databases

Building the RAG Pipeline

Similarity Calculations

Watch Video