KodeKloud Notes

In this guide, you’ll learn how to connect PDF loaders, chunk your documents, generate embeddings, and query a vector database to build a Retrieval-Augmented Generation (RAG) pipeline using LangChain.

Warning

Make sure your OPENAI_API_KEY is set in the environment before running any of the code snippets.

1. Install Dependencies

pip install langchain langchain-community langchain-openai langchain-chroma

2. Import Required Modules

We’ll use:

PyPDFLoader to load PDFs
RecursiveCharacterTextSplitter for chunking
OpenAIEmbeddings & ChatOpenAI for LLM interactions
Chroma as our vector store
PromptTemplate and core runnables

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

3. Load and Chunk the PDF

3.1 Load the PDF

loader = PyPDFLoader("data/handbook.pdf")
pages = loader.load_and_split()

3.2 Split into Overlapping Chunks

Note

Adjust chunk_size and chunk_overlap based on your document length and the LLM’s context window.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
chunks = text_splitter.split_documents(pages)

4. Generate Embeddings & Build Vector Store

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings)

5. Configure Retriever & Formatter

Convert the vector store to a retriever and define a helper to merge relevant chunks:

retriever = vectorstore.as_retriever()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

6. Define the LLM and Prompt Template

We’ll create a system prompt that forces the model to answer based only on retrieved context:

llm = ChatOpenAI()

template = """
SYSTEM: You are a question-answering assistant.
Use only the provided context to answer.
If you don’t know, respond with “I don’t know.”
QUESTION: {question}
CONTEXT:
{context}
"""
prompt = PromptTemplate.from_template(template)

7. Assemble the RAG Chain

Chain steps:

Retrieve relevant chunks
Format them into a context string
Inject into the prompt template
Generate a factual answer
Parse output to string

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

8. Query the Chain

chain.invoke("How many sick leaves are allowed in a year?")
chain.invoke("How many unpaid personal leaves are allowed in a year?")
chain.invoke("What's the sick leave policy?")
# 'You are eligible for 10 days of paid sick leave per year, which can be used for any illness or injury that prevents you from working.'

Summary of Steps

Step	Action	Description
1	Install & Import	Install `langchain`, import loaders, splitters, embeddings, etc.
2	Load PDF	Use `PyPDFLoader`
3	Chunk Documents	Split pages with `RecursiveCharacterTextSplitter`
4	Embed & Store Vectors	Generate embeddings and index with `Chroma.from_documents`
5	Create Retriever	Convert vector store into a retriever
6	Format Context	Define `format_docs` helper
7	Define Prompt & LLM	Set up `PromptTemplate` and `ChatOpenAI`
8	Build & Invoke Chain	Assemble LCEL chain and query with `chain.invoke()`

Watch Video

Watch video content

RAG with PDFs