A tutorial demonstrating a reusable PDFParser that extracts selectable text, splits it into RAG-friendly chunks, emits metadata, and provides usage examples and OCR guidance.
Earlier we built a Python script to parse and chunk DOCX files. In this lesson we’ll add a reusable PDFParser to the toolkit so your retrieval-augmented generation (RAG) pipeline can handle PDFs as well.Why this matters for RAG:
A useful knowledge base must handle multiple document types.
PDFs vary widely: some contain selectable text, others are scanned images that require OCR.
This parser extracts selectable text (when present), splits it into context-aware chunks, and emits structured metadata to feed an embedding / vector store pipeline.
Install the required packages in your virtual environment:
pip install PyPDF2 langchain
PyPDF2 extracts selectable text from PDF files. If your PDF is a scanned image (no selectable text), add an OCR preprocessing step such as Tesseract with pytesseract before feeding the text into the chunker.
This module provides a compact, reusable class that:
extracts selectable text from PDFs,
splits text into chunks optimized for RAG ingestion,
returns structured metadata for each chunk,
optionally writes a human-readable dump for inspection.
Save the file as pdf_parser.py.
# pdf_parser.pyfrom typing import List, Dict, Anyfrom pathlib import Pathimport jsonimport PyPDF2from langchain.text_splitters import RecursiveCharacterTextSplitterclass PDFParser: """Parse PDF files and prepare them for RAG ingestion.""" def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200): """ Initialize the PDF parser. Args: chunk_size: Maximum size of each text chunk in characters. chunk_overlap: Number of characters to overlap between chunks. """ self.chunk_size = chunk_size self.chunk_overlap = chunk_overlap # separators chosen to prefer paragraph/sentence boundaries, then fallback to spaces self.text_splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len, separators=["\n\n", "\n", ". ", " ", ""], ) def extract_text_from_pdf(self, pdf_path: str) -> str: """ Extract all selectable text from a PDF file. Args: pdf_path: Path to the PDF file. Returns: Extracted text as a single string. Raises: FileNotFoundError: If the PDF file doesn't exist. Exception: If there's an error reading the PDF. """ path = Path(pdf_path) if not path.exists(): raise FileNotFoundError(f"PDF file not found: {pdf_path}") text_parts: List[str] = [] try: with open(pdf_path, "rb") as f: reader = PyPDF2.PdfReader(f) num_pages = len(reader.pages) for i in range(num_pages): page = reader.pages[i] # extract_text() may return None for some pages; guard it page_text = page.extract_text() if page_text: text_parts.append(page_text) except Exception as e: raise Exception(f"Error reading PDF: {e}") # Join pages with double newline to preserve some structure return "\n\n".join(text_parts).strip() def chunk_text(self, text: str) -> List[str]: """ Split text into chunks suitable for RAG ingestion. Args: text: The text to chunk. Returns: List of text chunks. """ if not text: return [] return self.text_splitter.split_text(text) def parse_pdf(self, pdf_path: str) -> List[Dict[str, Any]]: """ Parse a PDF and return chunked text with metadata. Args: pdf_path: Path to the PDF file. Returns: List of dictionaries containing chunk text and metadata. """ text = self.extract_text_from_pdf(pdf_path) chunks = self.chunk_text(text) results: List[Dict[str, Any]] = [] total_chunks = len(chunks) for idx, chunk in enumerate(chunks): results.append({ "chunk_id": idx, "text": chunk, "source": str(Path(pdf_path).resolve()), "chunk_size": len(chunk), "total_chunks": total_chunks, }) return results def parse_pdf_to_file(self, pdf_path: str, output_path: str) -> None: """ Parse a PDF and save the chunked results to a text file for inspection. Args: pdf_path: Path to the PDF file. output_path: Path to write the output text file. """ results = self.parse_pdf(pdf_path) with open(output_path, "w", encoding="utf-8") as f: f.write("PDF Parser Results\n") f.write(f"Source: {pdf_path}\n") f.write(f"Total Chunks: {len(results)}\n") f.write("-" * 80 + "\n\n") for result in results: f.write(f"Chunk {result['chunk_id'] + 1}/{result['total_chunks']}\n") f.write(f"Size: {result['chunk_size']} characters\n") f.write("-" * 80 + "\n") f.write(result["text"].strip() + "\n\n") f.write("=" * 80 + "\n\n")
Key design notes:
The parser raises a clear FileNotFoundError if the PDF path is invalid.
PyPDF2.PdfReader().pages[i].extract_text() can return None for pages — these are skipped to avoid errors.
The text splitter prioritizes paragraph and sentence boundaries before falling back to spaces, producing readable chunks for embeddings.
For scanned/image-only PDFs, add OCR beforehand (see the warning below).
Table — PDFParser public methods
Method
Purpose
Returns
extract_text_from_pdf(pdf_path)
Extract selectable text from PDF pages
str (joined page text)
chunk_text(text)
Split a long text string into RAG-friendly chunks
List[str]
parse_pdf(pdf_path)
Full pipeline: extract, chunk, and return metadata-enhanced results
List[Dict[str, Any]]
parse_pdf_to_file(pdf_path, output_path)
Save a human-readable dump of chunks to a text file
This is an example of the console output when running python main.py with a small sample PDF that produces 3 chunks.
Example 1: Parsing PDF to structured data--------------------------------------------------------------------------------Successfully parsed: sample.pdfTotal chunks created: 3First chunk preview:The Fable of Fiona the Fussy Feline Fiona, a tuxedo cat of considerable fluff and questionable temper, believed the world revolved entirely around the timely provision of tuna in springwater, not brin...Chunks saved to: chunks.json================================================================================Example 2: Parsing PDF to text file--------------------------------------------------------------------------------Chunks saved to: output.txt================================================================================Example 3: Custom chunking parameters--------------------------------------------------------------------------------Chunks with size=500: 7Chunks with size=2000: 2
This parser processes selectable text only. For scanned PDFs you must run OCR before parsing.
Tune chunk_size and chunk_overlap for your model/embedding limits — smaller chunks increase recall but reduce context; larger chunks increase context but consume more embedding tokens.
The chunks.json file is ready to be embedded and stored in your vector DB for RAG use cases.
If a PDF contains no selectable text (for example, scanned pages), PyPDF2 will not extract the content. In that case, perform OCR (Tesseract + pytesseract or a cloud OCR service) to produce text before using this parser. Consider adding an automatic OCR fallback for production pipelines.
If you want to extend this parser, consider:
Adding page-level metadata (page numbers, section headings if extractable).
Detecting image-only pages and automatically running OCR.
Integrating directly with an embeddings pipeline (e.g., to automatically insert chunks into a vector store).