Understanding Embedding Models

In this lesson you’ll learn what embedding models are, how embeddings are generated, and why they are essential to retrieval-augmented generation (RAG) and semantic search systems.

Keyword search vs. semantic search

Traditional keyword search is like a digital Control-F: it finds literal text matches but ignores meaning. A search for “car” misses “automobile” or “vehicle”; “physician” may miss content that uses “doctor”. Because keyword search ignores intent and context, users often must guess the precise phrasing to get relevant results.

The image is a presentation slide titled "Embedding Models" with two points: "Context and intent are ignored" and "Users must guess exact terms."

Semantic search solves this by matching meaning instead of exact tokens — enabling search systems to return relevant results even when different words or phrasing are used.

The image is a graphic about "Embedding Models," focusing on "Semantic Search" that emphasizes understanding meaning beyond literal text. It includes an icon of a magnifying glass over a list.

What is an embedding?

An embedding is a numerical vector representation of text: an array of floating-point numbers (often hundreds or thousands of dimensions) that encodes semantic meaning. In this high-dimensional vector space, similar concepts cluster together: for example, “dog” and “cat” are close, while “dog” and “car” are farther apart. This arrangement enables efficient similarity comparisons between text items.

The image explains embeddings, showing how text is converted into numerical vectors and describes semantic distance, indicating that words with similar meanings are closer in the vector space.

Embeddings use vector mathematics to quantify similarity. Common measures include:

Cosine similarity: dot product of normalized vectors, often used to measure angle (semantic closeness) between vectors.
Dot product: raw inner product useful for some models/indexes.

Higher similarity scores indicate closer semantic meaning (e.g., a higher score between “dog” and “cat” than between “dog” and “car”).

How are embeddings created?

Typical production flow for generating embeddings:

Tokenize raw text into subword units.
Feed tokens into a neural network trained on large corpora.
Produce a dense vector that captures semantic relationships across tokens and context.

The image is a flowchart describing how embeddings are created, with four steps: Text Input, Tokenization, Neural Network, and Vector Output.

Practical example — generating embeddings with the OpenAI Python client:

from openai import OpenAI

client = OpenAI()
embedding = client.embeddings.create(
    input="The quick brown fox jumps over the lazy dog",
    model="text-embedding-3-large"
)

# The API response contains the embedding vector:
# embedding.data[0].embedding -> [0.032, -0.112, 0.540, 0.891, -0.234, 0.678, ...]
# (vector truncated; embeddings may have hundreds or thousands of dimensions)

Each element of the vector contributes to positioning the text in semantic space; the vector as a whole represents meaning, not random numbers.

Storage and retrieval: vector databases

Embeddings are most useful when stored in vector databases designed for similarity search at scale. These systems use specialized indexes (including approximate nearest neighbor algorithms) to find the closest vectors quickly, even across millions or billions of items. Common steps when preparing documents for a RAG system:

Convert each document (or document chunk) into an embedding using a chosen model.
Store embeddings and metadata (document IDs, text, source, timestamps) in a vector database.
At query time, embed the user query with the same model and search the vector DB for nearest neighbors.
Use the top retrieved documents as grounding context for an LLM to generate factual, relevant responses.

Always embed both your documents and user queries with the same embedding model — mixing models will produce inconsistent vector spaces and degrade similarity results.

Vector database examples

Vector DB	Use case
`Pinecone` (pinecone.io)	Managed vector DB with production-ready indexing and scaling
`FAISS` (github.com/facebookresearch/faiss)	High-performance local/cluster library for ANN search
`Chroma` (trychroma.com)	Lightweight open-source vector store for prototyping
`Weaviate` (weaviate.io)	Schema-driven vector DB with semantic search features

Why embeddings power RAG systems

Embeddings enable RAG systems to:

Perform semantic matching by placing queries and documents in the same vector space.
Reduce hallucinations by grounding LLM outputs in retrieved, relevant documents.
Support dynamic knowledge: add or update embeddings as information changes without retraining the LLM.

The image explains why RAG (Retrieval-Augmented Generation) needs embeddings, highlighting semantic matching, reduced hallucinations, and dynamic knowledge.

Choosing an embedding model: trade-offs

Model selection is a balance of accuracy, speed, and cost:

Consideration	Trade-off
Dimensionality & size	Larger models often yield higher-quality vectors but increase storage and compute costs.
Latency	Smaller models return embeddings faster and at lower cost — useful for high-throughput systems or prototypes.
Accuracy	Higher-fidelity embeddings improve retrieval relevance and downstream LLM performance.
Total cost	Evaluate total cost of ownership: embedding count, dimensionality, query volume, and infrastructure.

For enterprise-grade systems where accuracy is critical, invest in higher-quality embeddings and thorough evaluation. For prototypes or cost-sensitive deployments, smaller models can provide acceptable performance with lower cost and latency.

Key takeaways

Meaning as math: embeddings convert words and sentences into numerical vectors that capture semantic relationships.
Semantic search: embeddings enable retrieval by meaning rather than exact text matches.
RAG foundation: embeddings allow LLMs to draw on relevant, retrieved context for grounded responses.
Quality matters: better embeddings yield more accurate, reliable, and performant retrieval-based systems.

Embeddings are the quiet engine behind semantic search and RAG: they bridge natural language and vector mathematics so systems can find and retrieve the information that best matches user intent.

Links and references

Kubernetes Documentation (general infrastructure reference)
Pinecone
FAISS
Chroma
Weaviate

Introduction

Document Processing and Chunking

Keyword Search & Retrieval

Semantic Search & Embeddings

Vector Databases

Building the RAG Pipeline

Understanding Embedding Models

Keyword search vs. semantic search

What is an embedding?

How are embeddings created?

Storage and retrieval: vector databases

Why embeddings power RAG systems

Choosing an embedding model: trade-offs

Key takeaways

Links and references

Watch Video

​Keyword search vs. semantic search

​What is an embedding?

​How are embeddings created?

​Storage and retrieval: vector databases

​Why embeddings power RAG systems

​Choosing an embedding model: trade-offs

​Key takeaways

​Links and references

Watch Video

Keyword search vs. semantic search

What is an embedding?

How are embeddings created?

Storage and retrieval: vector databases

Why embeddings power RAG systems

Choosing an embedding model: trade-offs

Key takeaways

Links and references