LangChain

Performing Retrieval

RAG with PDFs

In this guide, you’ll learn how to connect PDF loaders, chunk your documents, generate embeddings, and query a vector database to build a Retrieval-Augmented Generation (RAG) pipeline using LangChain.

Warning

Make sure your OPENAI_API_KEY is set in the environment before running any of the code snippets.


1. Install Dependencies

pip install langchain langchain-community langchain-openai langchain-chroma

2. Import Required Modules

We’ll use:

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

3. Load and Chunk the PDF

3.1 Load the PDF

loader = PyPDFLoader("data/handbook.pdf")
pages = loader.load_and_split()

3.2 Split into Overlapping Chunks

Note

Adjust chunk_size and chunk_overlap based on your document length and the LLM’s context window.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
chunks = text_splitter.split_documents(pages)

4. Generate Embeddings & Build Vector Store

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings)

5. Configure Retriever & Formatter

Convert the vector store to a retriever and define a helper to merge relevant chunks:

retriever = vectorstore.as_retriever()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

6. Define the LLM and Prompt Template

We’ll create a system prompt that forces the model to answer based only on retrieved context:

llm = ChatOpenAI()

template = """
SYSTEM: You are a question-answering assistant.
Use only the provided context to answer.
If you don’t know, respond with “I don’t know.”
QUESTION: {question}
CONTEXT:
{context}
"""
prompt = PromptTemplate.from_template(template)

7. Assemble the RAG Chain

Chain steps:

  1. Retrieve relevant chunks
  2. Format them into a context string
  3. Inject into the prompt template
  4. Generate a factual answer
  5. Parse output to string
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

8. Query the Chain

chain.invoke("How many sick leaves are allowed in a year?")
chain.invoke("How many unpaid personal leaves are allowed in a year?")
chain.invoke("What's the sick leave policy?")
# 'You are eligible for 10 days of paid sick leave per year, which can be used for any illness or injury that prevents you from working.'

Summary of Steps

StepActionDescription
1Install & ImportInstall langchain, import loaders, splitters, embeddings, etc.
2Load PDFUse PyPDFLoader
3Chunk DocumentsSplit pages with RecursiveCharacterTextSplitter
4Embed & Store VectorsGenerate embeddings and index with Chroma.from_documents
5Create RetrieverConvert vector store into a retriever
6Format ContextDefine format_docs helper
7Define Prompt & LLMSet up PromptTemplate and ChatOpenAI
8Build & Invoke ChainAssemble LCEL chain and query with chain.invoke()

Further Reading & References

Watch Video

Watch video content

Previous
Performing Semantic Search