Performing Retrieval

Welcome back! In this lesson, we’ll dive into retrieval techniques and explore how retrieval-augmented generation (RAG) enhances your LLM applications with external context.

Why Retrieval Matters

Large language models rely on context—external data and background—to generate accurate, factual responses. Without relevant context, models often “hallucinate,” producing plausible but incorrect information.

Data Source	Example Systems	Use Case
Relational Databases	MySQL, PostgreSQL, SQLite	Structured queries via SQL
NoSQL Databases	MongoDB	Flexible document storage
Full-Text Search	Elasticsearch	Keyword-based document retrieval
Vector Databases	Pinecone, Chroma, Milvus, Weaviate	Semantic search with embeddings
Document Stores	PDF, Word, HTML, CSV (e.g., Amazon S3)	Unstructured file retrieval
APIs & Web Search	REST endpoints, real-time web scraping	Live data fetching

The image displays logos of various database systems like MySQL, SQLite, PostgreSQL, and MongoDB, along with icons representing different file types and formats such as PDF, HTML, DOC, and API.

Retrieving only the most relevant passages keeps your prompt within the LLM’s context window (e.g., 4,096 tokens) and reduces hallucinations.

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a two-step framework:

Retrieve relevant context from external data sources.
Augment the LLM prompt with those facts before generation.

The image illustrates the concept of Retrieval Augmented Generation (RAG), showing a flow between external data sources and a large language model (LLM) to provide additional facts.

Why Not Send the Entire Document?

Context window limits: LLMs can’t process an entire 100-page PDF in one go.
Efficiency: Fetching only necessary chunks saves on compute and token usage.
Explainability: You can cite the exact source of each fact.
Privacy & Control: Only selected passages are exposed to the model.

RAG Workflow Overview

A typical RAG pipeline consists of five steps:

User Query: The user asks a question.
Search: The system queries databases or document stores.
Context Injection: Retrieved passages are inserted into the prompt.
LLM Call: The augmented prompt is sent to the LLM.
Response: The model generates an answer, which is returned to the user.

The image illustrates a RAG (Retrieval-Augmented Generation) workflow, showing the interaction between a user, a chatbot, a large language model (LLM), and databases/documents for search and retrieval.

Two Phases of RAG

Phase 1: Indexing

Load unstructured documents (PDFs, HTML, JSON, images).
Split each document into manageable chunks (sentences, paragraphs, or token windows).
Embed each chunk via an embeddings model, producing numerical vectors.
Store these vectors in a vector database (e.g., Chroma, Milvus, Weaviate).

Choosing the right chunk size (e.g., 500 tokens) balances retrieval precision and recall.

from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# 1. Load documents
loader = UnstructuredPDFLoader("path/to/doc.pdf")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# 3. Create embeddings and store
embeddings = OpenAIEmbeddings()
vector_db = Chroma.from_documents(chunks, embeddings, persist_directory="chroma_db")
vector_db.persist()

The image illustrates "RAG – Phase 1" with a flowchart showing four stages: Load, Split, Embed, and Store, with arrows indicating the process flow. The final stage, Store, contains numerical data arrays.

Phase 2: Retrieval

Embed the user’s query using the same embeddings model.
Perform a semantic search against stored vectors in the vector database.
Retrieve the top‐k matching chunks.
Augment the original prompt with those chunks.
Generate the final answer via the LLM.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Create a retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 5})

# Build a RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model_name="gpt-4"),
    chain_type="stuff",
    retriever=retriever
)

# Ask a question
response = qa_chain.run("What are the main benefits of RAG?")
print(response)

The image illustrates the RAG (Retrieval-Augmented Generation) Phase 2 process, showing a flow from a question to retrieval, context, prompt, and then to a large language model (LLM).

Building a RAG Pipeline with LangChain

LangChain offers modular components to assemble a complete RAG workflow:

Component	Description
Document Loaders	Fetch external data (PDFs, web pages, APIs)
Text Splitters	Break documents into searchable chunks
Embeddings Models	Convert text chunks into semantic vectors
Vector Stores	Store and index vectors for efficient similarity search
Retrievers	Query vector stores to retrieve the most relevant document chunks

The image is about "RAG With LangChain" and shows a diagram with icons representing document loaders and external data sources, including a web search and PDF.

When these modules are connected, you get a robust RAG pipeline. Next, we’ll build a Q&A application that ingests both a PDF and a web page.

The image shows a graphic of a computer screen with a cube icon and a disc, labeled "RAG Workflow," with the text "RAG With LangChain" at the top.

Let’s jump into the demo!

Introduction

Overview of LangChain

Key Components of LangChain

Interacting with LLMs

Tips, Tricks, and Resources

Introduction to LCEL

Adding Memory to LLM Apps

Performing Retrieval

Implementing Chains

Using Tools

Building Agents

Conclusion

Building Blocks of LLM Apps

Performing Retrieval

Why Retrieval Matters