Skip to main content
This lesson continues the document ingestion series for retrieval-augmented generation (RAG) systems and focuses on parsing Microsoft Word DOCX files with Python. We’ll build a robust DOCX ingestion pipeline that:
  • extracts text and core metadata from DOCX files,
  • splits large documents into intelligent, overlapping chunks suitable for embedding,
  • preserves paragraph and sentence boundaries when possible to improve retrieval quality for downstream LLM usage.
What you’ll get
  • A minimal, production-friendly DocxParser implementation (parsing, chunking, and orchestration).
  • A chunking strategy that prefers paragraph and sentence boundaries, with word-boundary fallback.
  • A small example script to run and inspect chunks before embedding into a vector store.
Contents
  • Setup
  • Quick note about files
  • Imports and the core class (complete code you can save as main.py)
  • How the parser works (summary)
  • Output example and next steps
  • Links and references
Setup Run these commands to create a Python virtual environment and install the DOCX parsing dependency:
# Create project and virtual environment
touch main.py
python3 -m venv venv
source venv/bin/activate

# Install the DOCX parser dependency
pip install python-docx
Note: Ensure you run these commands in the directory where you’ll keep your DOCX files (for example, the same directory as main.py).
Make sure a DOCX file named Sample.docx (or another filename you pass to the script) is present in the same directory when you run the example below.
Imports and core class Below is a consolidated implementation of the DOCX ingestion pipeline. Save the code into main.py. This single file contains:
  • parse_docx — read paragraphs and extract basic core properties into metadata.
  • chunk_text — split text into overlapping chunks while preferring paragraph and sentence boundaries, and falling back to word boundaries.
  • process_document — orchestrator that parses and chunks, injecting document metadata into each chunk.
  • _generate_doc_id — convenience helper to create a deterministic document ID.
from docx import Document
from pathlib import Path
from typing import List, Dict, Optional
import hashlib
import datetime


class DocxParser:
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        """
        Args:
            chunk_size: maximum number of characters per chunk
            chunk_overlap: number of characters to overlap between chunks
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def parse_docx(self, file_path: str) -> Dict:
        """
        Parse a DOCX file and extract content with metadata.

        Returns:
            dict: {
                'content': str,
                'paragraphs': List[str],
                'metadata': Dict[str, Optional[str]]
            }
        """
        path = Path(file_path)
        doc = Document(file_path)

        # Extract paragraphs (skip empty paragraphs)
        paragraphs: List[str] = []
        for para in doc.paragraphs:
            text = para.text.strip()
            if text:
                paragraphs.append(text)

        # Join paragraphs with double newline to preserve paragraph boundaries
        content = "\n\n".join(paragraphs)

        # Core properties may be None
        core_props = doc.core_properties
        created = None
        modified = None
        try:
            if core_props.created:
                created = core_props.created.isoformat()
        except Exception:
            # Some versions or properties may not be datetime; convert to str as fallback
            created = str(core_props.created) if core_props.created else None

        try:
            if core_props.modified:
                modified = core_props.modified.isoformat()
        except Exception:
            modified = str(core_props.modified) if core_props.modified else None

        metadata = {
            "filename": path.name,
            "file_path": str(path.absolute()),
            "title": core_props.title or "Untitled",
            "author": core_props.author or "Unknown",
            "created": created,
            "modified": modified,
        }

        return {"content": content, "paragraphs": paragraphs, "metadata": metadata}

    def chunk_text(self, text: str) -> List[Dict]:
        """
        Split text into overlapping chunks.

        The method prioritizes:
          1. Paragraph boundaries ('\n\n')
          2. Sentence endings ('.', '!', '?')
          3. Word boundaries (whitespace)

        Returns:
            List of chunk dictionaries with metadata:
            {
                'chunk_id': int,
                'text': str,
                'start_char': int,
                'end_char': int,
                'chunk_length': int
            }
        """
        chunks: List[Dict] = []
        start = 0
        chunk_id = 0
        text_length = len(text)

        sentence_end_chars = {".", "!", "?"}

        while start < text_length:
            # Default end (cap at text_length)
            end = min(start + self.chunk_size, text_length)

            if end < text_length:
                # Search for paragraph break within the overlap zone
                search_start = max(start, end - self.chunk_overlap)
                found = False

                # Look for paragraph break '\n\n'
                for i in range(end, search_start - 1, -1):
                    if text[i : i + 2] == "\n\n":
                        end = i + 2  # include the paragraph break
                        found = True
                        break

                # Look for sentence ending if no paragraph break found
                if not found:
                    sentence_search_start = max(start, end - max(100, self.chunk_overlap // 2))
                    for i in range(end - 1, sentence_search_start - 1, -1):
                        if text[i] in sentence_end_chars:
                            end = i + 1  # include the sentence-ending punctuation
                            found = True
                            break

                # Finally, fallback to word boundary (whitespace)
                if not found:
                    for i in range(end - 1, search_start - 1, -1):
                        if text[i].isspace():
                            end = i
                            found = True
                            break

                # If still not found, keep the original end (hard cut)
            # Extract the chunk text
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append(
                    {
                        "chunk_id": chunk_id,
                        "text": chunk_text,
                        "start_char": start,
                        "end_char": end,
                        "chunk_length": len(chunk_text),
                    }
                )
                chunk_id += 1

            # Advance start; if not at end, move back by overlap
            if end >= text_length:
                break
            start = end - self.chunk_overlap if end < text_length else end

        return chunks

    def process_document(self, file_path: str) -> List[Dict]:
        """
        Complete pipeline: parse DOCX and create chunks, attaching document metadata to each chunk.

        Returns:
            List[Dict]: chunks with attached 'document_metadata' and 'document_id'
        """
        # Parse the document
        doc_data = self.parse_docx(file_path)

        # Chunk the full content
        chunks = self.chunk_text(doc_data["content"])

        # Add document metadata to each chunk
        document_id = self._generate_doc_id(str(file_path))
        for chunk in chunks:
            chunk["document_metadata"] = doc_data["metadata"]
            chunk["document_id"] = document_id

        return chunks

    def _generate_doc_id(self, file_path: str) -> str:
        """Generate a deterministic document ID (MD5 of the file path)."""
        return hashlib.md5(file_path.encode("utf-8")).hexdigest()


if __name__ == "__main__":
    # Ensure you have a file named 'Sample.docx' in the same directory!
    parser = DocxParser(chunk_size=1000, chunk_overlap=200)

    # Process the document and get our chunks
    chunks = parser.process_document("Sample.docx")

    # Print the results for inspection
    for chunk in chunks:
        print(f"---- Chunk {chunk['chunk_id']} (Length: {chunk['chunk_length']}) ----")
        print(chunk["text"])
        print(f"Source: {chunk['document_metadata']['title']} (ID: {chunk['document_id']})")
        print("________________________________________________________________\n")
How the parser works (summary)
  • parse_docx
    • Uses python-docx to read paragraph texts.
    • Removes empty paragraphs and joins paragraphs with a double-newline (\n\n) so that paragraph boundaries are preserved and available to the chunking logic.
    • Extracts core properties (title, author, created, modified) when available and returns them as metadata.
  • chunk_text
    • Slides a window of size chunk_size across the full text.
    • Within the overlap region it prefers:
      1. Paragraph boundary (\n\n)
      2. Sentence-ending punctuation (., !, ?)
      3. Word boundary (whitespace)
    • If none of the above are found in the overlap, it performs a hard cut.
    • Produces overlapping chunks by advancing start to end - chunk_overlap.
  • process_document
    • Orchestrates parsing and chunking, attaches document_metadata and a deterministic document_id to each chunk. Chunks are ready for embedding and storage in a vector database.
DocxParser configuration and outputs
ParameterPurposeExample
chunk_sizeMaximum characters per chunk1000
chunk_overlapOverlap characters between consecutive chunks200
Chunk dictionary schema (each chunk returned by process_document):
KeyDescriptionExample
chunk_idInteger chunk index0
textThe chunked text string"...paragraph text..."
start_charStart offset in original text0
end_charEnd offset in original text966
chunk_lengthLength of the text field966
document_metadataDocument-level metadata (see below)See below
document_idDeterministic ID for provenancemd5 hash
Document metadata keys produced by parse_docx:
KeyDescriptionExample
filenameFile nameSample.docx
file_pathAbsolute path/home/user/project/Sample.docx
titleDocument title (or Untitled)My Report
authorAuthor (or Unknown)Jane Doe
createdCreated timestamp (ISO format)2023-05-01T12:00:00
modifiedModified timestamp (ISO format)2023-05-02T08:30:00
Example output (sample) When you run python main.py against a small DOCX (Sample.docx), you’ll see printed chunks similar to:
---- Chunk 0 (Length: 966) ----
The Fable of Fiona the Fussy Feline

Fiona, a tuxedo cat of considerable fluff and questionable temper, believed the world revolved entirely around the timely provision of tuna in springwater, not brine. She lived with a human named Bernard, a kind but perpetually confused soul who often purchased the wrong variety.

"Mrow?" Fiona would inquire, delicately batting a paw at the offensive can, a dramatic sigh escaping her tiny pink nose. The sound was less a plea and more a deeply felt existential critique of Bernard's life choices.

The Sardine Incident
...
Source: The Fable of Fiona the Fussy Feline (ID: 1a2b3c4d5e6f...)
________________________________________________________________

---- Chunk 1 (Length: 936) ----
Title: Untitled

Bernard's Excuse: "They were on a special, Fiona! And they have extra Omega-3s!"
...
Source: Untitled (ID: 1a2b3c4d5e6f...)
________________________________________________________________
Next steps and recommendations
  • Embed each chunk’s text with your embedding model and persist the vectors along with document_metadata and document_id in your vector database. Use the metadata to provide provenance during retrieval and answer generation.
  • Extend parse_docx to extract headings, tables, footnotes, and other structured content to improve chunk semantics and retrieval precision.
  • Tune chunk_size/chunk_overlap for your embedding model and retrieval latency: larger chunks reduce the number of vectors but may reduce relevance granularity.
Best practices
  • Keep paragraph breaks (\n\n) intact when possible to make chunks more semantically meaningful.
  • Store document_id and filename with vectors for traceability.
  • For multi-document ingestion, compute a file hash or use a content hash for deduplication.
Links and references This parser is a simple, extensible baseline for DOCX ingestion in RAG pipelines and integrates well with embedding models and vector stores for scalable retrieval.

Watch Video