Skip to main content
This guide covers how to prepare text for LLM ingestion by comparing chunking techniques, explaining why chunking matters, and showing how to pick the best approach for retrieval-augmented generation (RAG) systems. Why chunk at all? LLMs have finite context windows (commonly ~4k to ~128k tokens depending on the model). Entire knowledge bases or very long documents rarely fit into one prompt. Even when they do, large documents slow retrieval and dilute relevant signals with noise, making it harder for the model to find the right context. Practical benefits of chunking:
  • Respect model context limits (tokens).
  • Speed up retrieval and ranking (smaller units are faster to search).
  • Reduce noise and improve recall by limiting irrelevant text inside any one context window.
Anatomy of a “good” chunk A chunk is a tradeoff: small enough to be searchable and to fit in your target context window, but large enough to preserve semantic meaning. There’s no universal “perfect” chunk — only the right choice for your use case. A strong chunk usually exhibits:
  • Coherence: self-contained and semantically complete where possible (avoid cutting mid-sentence or mid-thought).
  • Appropriate size: fits your target token limit while retaining required concepts.
  • Overlap: intentional overlap with adjacent chunks (commonly ~10–20% or 1–2 sentences) so multi-sentence concepts don’t disappear across boundaries.
  • Natural boundaries: prefer paragraph, heading, code block, or section boundaries when available.
  • Provenance metadata: include source, document id, section/title, page number, timestamps, etc., so retrieved chunks can be traced to their origin.
The image depicts the "Anatomy of a Perfect Chunk," highlighting four elements: Coherence, Size (Tokens), Overlap, and Natural Boundaries, with brief descriptions for each.
Typical chunk sizes by use case
Use caseTypical chunk size (tokens)Notes
Q&A / retrieval256–512Good for short factual lookups and fast ranking.
Summarization / single-document synthesis1k–4kLarger chunks preserve longer context for summarization.
Long documents / book-scale512–1kCombine chunking with retrieval or multi-hop approaches.
Always pick sizes informed by your model’s context window and retrieval needs. Use the same tokenizer as your model to measure token counts accurately. Core chunking strategies
  1. Fixed-size chunking
  • What it is: Split text into fixed N-character or N-token chunks, optionally with a fixed overlap.
  • When to use: Quick ingestion pipelines, or when document structure is unavailable and speed/predictability is critical.
  • Pros: Simple, predictable, and fast to compute.
  • Cons: Semantically blind — may break sentences and reduce coherence.
Example (fixed-size chunking with overlap):
chunk_size = 1000  # characters or tokens depending on your splitter
overlap = 100
chunks = split_fixed(text, chunk_size, overlap)
The image is a slide discussing the advantages and disadvantages of a core strategy called "Fixed-Size Chunking." Advantages include simplicity and fast processing, while disadvantages include destroying coherence and high risk of breaking sentences mid-word.
  1. Context-aware splits (paragraph / sentence splitting)
  • What it is: Use natural textual delimiters — paragraphs, sentences, and line breaks — to create chunks.
  • Pros: Preserves linguistic boundaries and improves coherence compared to fixed-size splits. Easy to implement with standard NLP tools.
  • Cons: Chunk sizes vary; long paragraphs may still exceed token limits and require further splitting.
Best practices:
  • Combine paragraph/sentence splitting with a tokenizer check to ensure chunks fit your token budget.
  • Apply small overlaps (1–2 sentences) to prevent loss of cross-boundary context.
The image illustrates a core strategy for context-aware text splitting using natural delimiters, highlighting different methods: raw text, paragraph split, and sentence split.
The image contrasts the advantages and disadvantages of "Context-Aware Splits" in a core strategy, highlighting better coherence, respecting linguistic boundaries, and variable chunk sizes.
  1. Recursive split (recommended default)
  • What it is: Multi-pass splitting that preserves the largest natural units first and only splits deeper when chunks exceed size constraints.
  • Typical pass order: headings/sections → paragraphs → lines → sentences → words.
  • Why use it: Balances coherence and size constraints by adapting to document structure and avoiding blind truncation.
  • When to use: Default for most RAG pipelines — a strong choice for ~80% of use cases.
Conceptual recursive splitting flow:
  • Try to split by paragraph/section boundaries.
  • If a chunk still exceeds token limits, split by line breaks.
  • If still too large, split into sentences.
  • As a last resort, split by words or fixed token slices.
The image outlines a "Recursive Split" strategy with four methods: splitting by paragraphs, lines, sentences, and words. It describes steps to manage text chunk sizes using line and paragraph breaks, and sentence boundaries.
  1. Header / Markdown splitting
  • What it is: Use document structure (Markdown H1/H2/H3, or other structured headings) to keep headings and their content together.
  • Pros: Excellent for technical docs, API references, and knowledge bases — preserves hierarchy and section context.
  • Cons: Fails on unstructured prose and depends on well-formatted source documents.
When working with Markdown knowledge bases, prefer header-based splitting first and then apply recursive or sentence-level splitting inside large sections.
  1. Semantic / topic-shift splitting (advanced)
  • What it is: Use embeddings to detect semantic similarity and place boundaries where similarity drops below a threshold.
  • Process:
    1. Embed sentences or small units (e.g., with SentenceTransformers).
    2. Compute cosine similarities between adjacent embeddings.
    3. Insert a chunk boundary when similarity falls under a tuned threshold that indicates a topic shift.
  • Pros: Produces highly coherent, concept-aligned chunks — ideal for long content with shifting themes.
  • Cons: Computationally expensive and sensitive to embedding-model quality and threshold tuning.
Recommended resources:
The image is a slide titled "Core Strategy: Advanced Splitting Techniques," showing a comparison of advantages and disadvantages of the technique, highlighting semantic coherence and natural topic boundaries as advantages, and computational expense and slower processing as disadvantages.
Choosing the right strategy A practical workflow:
  1. Analyze your sources — are they structured (reports, Markdown) or unstructured (books, articles)?
  2. Default to recursive splitting — it provides a strong balance of coherence, size control, and simplicity.
  3. Build a ground-truth question set (queries with expected answers) and measure retrieval recall against top-k results.
  4. If recursive splitting underperforms:
    • Use header/Markdown splitting for structured documents, or
    • Use semantic/topic-shift splitting for long content with frequent theme changes.
  5. Consider hybrid approaches, e.g., header splitting to isolate sections, then recursive or semantic splitting inside each section.
Testing and metrics Create representative queries with known answers and evaluate:
  • Retrieval recall: do the top-k retrieved chunks contain the correct evidence?
  • Answer quality: does the LLM produce useful answers when given retrieved context?
Design metrics tolerant of paraphrase and semantic variation rather than requiring exact string matches. RAG systems are not fully deterministic — measure recall first, then end-to-end answer quality under your prompt and LLM configuration.
Measure retrieval and end-to-end answer quality using a ground-truth dataset. Start with recall-focused tests (does the correct chunk appear in top-k?) and then measure final answer quality with your prompt/LLM setup.
Always use the tokenizer that matches your embedding/model to measure token counts accurately. Token counting mismatches are a common source of overflow and unexpected behavior.
Key takeaways
  • Chunking is fundamental to RAG pipelines — poor chunking propagates errors downstream.
  • No single best strategy fits every document — choose based on document structure, query patterns, and compute constraints.
  • Start with recursive splitting as a default. It’s simple, fast, and effective for most cases.
  • For long or thematically shifting content, consider semantic splitting (with higher compute cost) or hybrid strategies.
  • Always test: build ground-truth queries, measure retrieval performance and answer quality, and iterate.
The image outlines key takeaways about document processing strategies, emphasizing the importance of chunking, using tailored strategies, starting with recursive methods, and upgrading to semantic splitting when possible.
Further reading and references

Watch Video