Skip to main content
Large language models (LLMs) rely on a context window to process input and generate useful responses. When an organization like TechCorp needs to search 500 GB of internal documents, traditional keyword search struggles with semantic variation (e.g., “vacation” vs “time off”). Embeddings and vector databases enable semantic search at scale and are a common foundation for retrieval-augmented generation (RAG) applications built with frameworks such as LangChain.
A hand-drawn chalkboard-style diagram labeled "Tech Corp's AI Application" sits in the center with arrows radiating outward. It links to sketches of components like a Large Language Model, LangChain pipeline, R.A.G., prompt engineering, and a vector/database icon.
Problem scenario
  • Suppose the 500 GB dataset contains an employee handbook covering policies like time off, dress code, and equipment use.
  • Users may phrase the same intent using different words: “vacation policy”, “time off guidelines”, or “Can I request time off on a holiday?” Keyword-based search often misses relevant content unless the exact words are present.
Traditional SQL / keyword approach
  • SQL and text-index approaches use exact or pattern matching (LIKE, full-text search). Users must guess correct words or rely on manual synonym expansion.
-- Traditional SQL approach: keyword search
SELECT * FROM documents
WHERE content LIKE '%Vacation%' OR content LIKE '%vacation policy%';

-- Example user query:
-- "Can I request time off on a holiday?"
  • Drawbacks: brittle to phrasing, requires query engineering, and often produces noisy or incomplete results.
Why embeddings + vector databases?
  • Embeddings convert text into fixed-length numeric vectors that capture semantic meaning. Related texts (e.g., “vacation” and “holiday”) map to nearby vectors in the embedding space.
  • When a user asks a natural language question, the system compares the embedding of the query with document embeddings and returns semantically similar content even if the wording differs.
  • This approach is ideal for RAG systems and LLM-driven assistants because it surfaces relevant context without retraining the LLM.
Popular vector databases and when to use them
Vector DatabaseBest Use CaseNotes
PineconeProduction-grade semantic search at scaleManaged service, easy integration, strong performance
ChromaDBLocal prototyping and small-to-medium deploymentsOpen-source, developer-friendly embedding store
(Also commonly used: FAISS for offline/embedded ANN indexing, Milvus for large-scale open-source vector search.) Embeddings: turning text into meaning
  • Embedding models map text chunks (sentences, paragraphs) to numeric vectors. Similar meanings yield nearby vectors.
  • Example workflow:
    1. Split documents into chunks (paragraphs or sections).
    2. Call an embedding model to convert each chunk into a vector.
    3. Store vectors in a vector DB with metadata (source, offsets, timestamps).
    4. For a user query, embed the query and retrieve the nearest vectors (top-K or above a threshold).
  • Benefit: the LLM receives relevant passages for context regardless of exact phrasing.
Dimensionality trade-offs
Dimension sizeProsCons
~256Lower storage & computeMay miss fine-grained semantic distinctions
~768–1536Good semantic expressiveness (common for many models)Higher storage & retrieval cost
>1536Captures more nuanceIncreased cost; diminishing returns beyond a point
  • Choose dimensionality based on the embedding model you select and the latency/storage budget.
Retrieval: scoring and chunking Two core decisions determine retrieval quality: similarity scoring and how you chunk documents. Scoring / similarity metrics
  • Common similarity measures:
    • Cosine similarity: robust when vectors are length-normalized.
    • Dot product: often used for models that produce normalized vectors or when scaling factors matter.
  • Index types:
    • Exact: brute-force nearest neighbors (slow for large corpora).
    • ANN (approximate): e.g., HNSW — trades minimal accuracy for large speed and memory gains.
  • Tuning:
    • Choose top-K results and/or a similarity threshold to filter noisy matches.
    • Too-low thresholds can return loosely related content (false positives); too-high may omit relevant passages.
Chunking and overlap
  • Rather than embedding whole documents, split into chunks (paragraphs, sliding windows) and embed each chunk.
  • Overlap windows (e.g., 200-token chunks with 50-token overlap) preserve context across boundaries.
  • Trade-offs:
    • Smaller chunks → more precise matches, but more vectors to store and search.
    • Larger chunks → fewer vectors but higher risk of mixing topics and lowering retrieval precision.
Comparison: SQL vs Vector-based retrieval
CharacteristicSQL / Keyword SearchVector / Embedding Search
Query typeKeyword/pattern matchingNatural language / semantic
Robustness to paraphraseLowHigh
Setup complexityLowMedium–High (embeddings, chunking, indexes)
Best forExact matches, structured queriesUnstructured documents, LLM context retrieval
Operational considerations and trade-offs
  • Design choices: embedding model, dimensionality, chunk size, overlap, similarity metric, ANN configuration, scoring thresholds.
  • Costs: storage for vectors, runtime cost of embedding generation, and compute for ANN queries.
  • Monitoring: track retrieval precision, false positives, and downstream LLM output quality to iterate on knobs.
  • Integration: vector DBs pair well with LangChain-style retrieval chains and RAG pipelines for chatbots and assistants.
A hand-drawn diagram comparing a traditional SQL database (rows, user burden) on the left to a vector database workflow on the right, showing retrieval feeding into scoring and chunk-overlap and then into an LLM. The sketch highlights chunking, overlap arrows, and notes like "no training required."
Resources and next steps
  • Prototype: use a small subset of documents with an open-source embedding model and Chroma/FAISS to validate chunk size and overlap choices.
  • Production: evaluate managed options (Pinecone, managed Milvus) for indexing, scaling, and operational support.
  • RAG integration: feed retrieved chunks into your LLM prompt or LangChain retrieval chain and measure answer quality.
When designing a vector-backed retrieval system, treat embedding creation, chunking strategy, similarity metric, and scoring thresholds as tunable knobs. Proper configuration upfront reduces noise in retrieval and improves downstream LLM responses.
Links and references

Watch Video