- Respect model context limits (tokens).
- Speed up retrieval and ranking (smaller units are faster to search).
- Reduce noise and improve recall by limiting irrelevant text inside any one context window.
- Coherence: self-contained and semantically complete where possible (avoid cutting mid-sentence or mid-thought).
- Appropriate size: fits your target token limit while retaining required concepts.
- Overlap: intentional overlap with adjacent chunks (commonly ~10–20% or 1–2 sentences) so multi-sentence concepts don’t disappear across boundaries.
- Natural boundaries: prefer paragraph, heading, code block, or section boundaries when available.
- Provenance metadata: include source, document id, section/title, page number, timestamps, etc., so retrieved chunks can be traced to their origin.

| Use case | Typical chunk size (tokens) | Notes |
|---|---|---|
| Q&A / retrieval | 256–512 | Good for short factual lookups and fast ranking. |
| Summarization / single-document synthesis | 1k–4k | Larger chunks preserve longer context for summarization. |
| Long documents / book-scale | 512–1k | Combine chunking with retrieval or multi-hop approaches. |
- Fixed-size chunking
- What it is: Split text into fixed N-character or N-token chunks, optionally with a fixed overlap.
- When to use: Quick ingestion pipelines, or when document structure is unavailable and speed/predictability is critical.
- Pros: Simple, predictable, and fast to compute.
- Cons: Semantically blind — may break sentences and reduce coherence.

- Context-aware splits (paragraph / sentence splitting)
- What it is: Use natural textual delimiters — paragraphs, sentences, and line breaks — to create chunks.
- Pros: Preserves linguistic boundaries and improves coherence compared to fixed-size splits. Easy to implement with standard NLP tools.
- Cons: Chunk sizes vary; long paragraphs may still exceed token limits and require further splitting.
- Combine paragraph/sentence splitting with a tokenizer check to ensure chunks fit your token budget.
- Apply small overlaps (1–2 sentences) to prevent loss of cross-boundary context.


- Recursive split (recommended default)
- What it is: Multi-pass splitting that preserves the largest natural units first and only splits deeper when chunks exceed size constraints.
- Typical pass order: headings/sections → paragraphs → lines → sentences → words.
- Why use it: Balances coherence and size constraints by adapting to document structure and avoiding blind truncation.
- When to use: Default for most RAG pipelines — a strong choice for ~80% of use cases.
- Try to split by paragraph/section boundaries.
- If a chunk still exceeds token limits, split by line breaks.
- If still too large, split into sentences.
- As a last resort, split by words or fixed token slices.

- Header / Markdown splitting
- What it is: Use document structure (Markdown H1/H2/H3, or other structured headings) to keep headings and their content together.
- Pros: Excellent for technical docs, API references, and knowledge bases — preserves hierarchy and section context.
- Cons: Fails on unstructured prose and depends on well-formatted source documents.
- Semantic / topic-shift splitting (advanced)
- What it is: Use embeddings to detect semantic similarity and place boundaries where similarity drops below a threshold.
- Process:
- Embed sentences or small units (e.g., with SentenceTransformers).
- Compute cosine similarities between adjacent embeddings.
- Insert a chunk boundary when similarity falls under a tuned threshold that indicates a topic shift.
- Pros: Produces highly coherent, concept-aligned chunks — ideal for long content with shifting themes.
- Cons: Computationally expensive and sensitive to embedding-model quality and threshold tuning.
- SentenceTransformers: https://www.sbert.net/

- Analyze your sources — are they structured (reports, Markdown) or unstructured (books, articles)?
- Default to recursive splitting — it provides a strong balance of coherence, size control, and simplicity.
- Build a ground-truth question set (queries with expected answers) and measure retrieval recall against top-k results.
- If recursive splitting underperforms:
- Use header/Markdown splitting for structured documents, or
- Use semantic/topic-shift splitting for long content with frequent theme changes.
- Consider hybrid approaches, e.g., header splitting to isolate sections, then recursive or semantic splitting inside each section.
- Retrieval recall: do the top-k retrieved chunks contain the correct evidence?
- Answer quality: does the LLM produce useful answers when given retrieved context?
Measure retrieval and end-to-end answer quality using a ground-truth dataset. Start with recall-focused tests (does the correct chunk appear in top-k?) and then measure final answer quality with your prompt/LLM setup.
Always use the tokenizer that matches your embedding/model to measure token counts accurately. Token counting mismatches are a common source of overflow and unexpected behavior.
- Chunking is fundamental to RAG pipelines — poor chunking propagates errors downstream.
- No single best strategy fits every document — choose based on document structure, query patterns, and compute constraints.
- Start with recursive splitting as a default. It’s simple, fast, and effective for most cases.
- For long or thematically shifting content, consider semantic splitting (with higher compute cost) or hybrid strategies.
- Always test: build ground-truth queries, measure retrieval performance and answer quality, and iterate.

- RAG patterns and best practices: https://platform.openai.com/docs/guides/retrieval
- SentenceTransformers (semantic splitting): https://www.sbert.net/
- Tokenizers and byte-pair encoding: https://huggingface.co/docs/tokenizers/usage
- Evaluation guidance for RAG: experiment with recall/top-k and end-to-end answer scoring; consider fuzzy matching and semantic similarity metrics.