- A reusable
DocumentChunkerimplementation (complete file below). - Practical examples and CLI invocations for each chunking strategy.
- Guidance on combining structural and size-limiting approaches for best results.
- Links to tokenizers and parsing libraries for production use.

Overview of chunking strategies
Below are the key strategies demonstrated by theDocumentChunker class. Each method trades off semantic alignment, chunk size control, and continuity across boundaries.
| Strategy | Best for | Notes / Typical options |
|---|---|---|
| Line-by-line | Fixed-record formats, logs | --max-lines <n> groups N lines per chunk. Low semantic awareness. |
| Fixed-size with overlap | Token-limited models | --chunk-size <chars> and --overlap <chars> preserve some context across boundaries. |
| Sliding-window | Smooth overlap for retrieval | --window-size <chars> and --step-size <chars> produce overlapping windows. |
| Sentence-based | Linguistic atomicity | --max-sentences <n> preserves sentence boundaries; sensitive to punctuation. |
| Paragraph-based | Natural semantic grouping | --max-paragraphs <n> relies on blank-line separators. |
| Page-based | PDFs / DOCX with page structure | --lines-per-page <n> or direct page extraction via parser. |
| Section / Heading-based | Structured docs (Markdown) | --heading-pattern '<regex>' splits by headings (e.g., ^\s*#{1,6}\s+). |
| Token-based | Model-aware chunking | --max-tokens <n> requires a tokenizer (e.g., tiktoken, HF tokenizers). |
- tiktoken: https://github.com/openai/tiktoken
- Hugging Face tokenizers: https://huggingface.co/docs/tokenizers/index
- PDF parsing libraries: PyPDF2, pdfplumber, PyMuPDF
- DOCX parsing: python-docx
1) Line-by-line chunking
Line-by-line chunking groups a fixed number of lines into each chunk. This method is reliable when records are line-oriented (logs, structured exports), but it has no semantic awareness — chunks can split sentences or paragraphs arbitrarily. Example CLI:- When input data maps to fixed-record line blocks.
- As a low-level primitive combined with semantic grouping.

2) Fixed-size chunking with overlap
Fixed-size chunking splits text into character-range chunks of a fixed length. Overlap preserves context when a semantic unit spans a boundary, which helps retrieval and question-answering tasks. Example CLI:- When you must guarantee a maximum chunk size for model input.
- Use overlap to reduce information loss across chunk boundaries.
3) Sliding-window chunking
Sliding-window chunking creates overlapping windows of a fixed size and advances by a step. Compared to naive fixed-size with overlap, sliding windows are often easier to reason about because overlap is controlled by the step size. Example CLI:- Boundaries (e.g., headings, TOC entries) will appear in multiple chunks, enabling robust retrieval or reranking.
4) Sentence-based chunking
Sentence-based chunking splits text into sentences and groups a fixed number of sentences per chunk. This yields linguistically coherent chunks but depends on accurate sentence splitting. Example CLI (max 3 sentences per chunk):- Short sentences produce small chunks; consider grouping more sentences or applying a minimum character or token threshold.
- For production, swap the simple regex splitter with a robust sentence tokenizer (e.g., spaCy).

5) Paragraph chunking
Paragraph-based chunking groups text at blank-line boundaries. Paragraphs are often good semantic units for many narrative or documentation-style sources. Example CLI:- This method requires clear paragraph delimiters. Preprocessing is sometimes needed if paragraphs are not separated by blank lines (e.g., in OCR output).
6) Page chunking (useful for PDFs and DOCX)
Page chunking uses page boundaries or approximated line ranges to preserve page-level layout. This is essential when headers, footers, or figures belong to a specific page. Usage notes:- The demo includes a 10-page DOCX. The chunker reports pages sequentially.
- Example metadata for the last page:
- Legal, academic, or scanned documents where page context matters.
- Combine with per-page OCR or page-aware parsing for better fidelity.
7) Section / Heading-based chunking (Markdown example)
Heading-based chunking splits by headings and preserves logical document structure — ideal for manuals, specs, and Markdown content. Example CLI (Markdown headings):- Adjust the
--heading-patternregex for your document format (e.g., HTML headers, reStructuredText). - Use this method first, then apply sentence, paragraph, or token-based limits inside sections.
Combining strategies & production tips
There is no one-size-fits-all chunker. Common, practical strategies include:- Run heading/section-based splitting first to preserve logical units, then apply token-based or fixed-size chunking within each section to respect model input limits.
- Use sliding windows or fixed overlap to ensure context continuity across boundaries when retrieval quality matters.
- Replace the naive token splitter with a production tokenizer:
tiktokenfor OpenAI models or Hugging Face tokenizers for other models.
- Tokenizers: tiktoken, Hugging Face tokenizers
- PDF parsing: pdfplumber, PyMuPDF
- DOCX parsing: python-docx
Best practice: combine structural chunking (sections/headings/pages) with size-limiting chunking (token or fixed-size with overlap). This preserves semantics while respecting model input limits.
document_chunker.py file shown above and the sample documents used in these examples.