Demo Ingesting CSV

This lesson shows how to convert a plain CSV file into RAG-ready documents suitable for a retriever. The aim is to produce one clean, searchable document per CSV row with:

A human-readable text field (concatenated non-empty columns),
A metadata object that preserves the original CSV fields and source information,
An optional JSON export that you can ingest into a vector store or retriever pipeline.

We’ll provide a single, self-contained Python script that:

reads a CSV file,
converts each row into a document with id, text, and metadata,
optionally writes the list of documents to a JSON file for ingestion.

CSV quirks to watch for: headers may include leading/trailing whitespace, values may be empty, and fields can contain commas or newlines. The script below handles common cases (skips empty values, uses UTF-8). For inconsistent or complex CSV sources, add normalization (trim headers, unify casing) or a preprocessor that handles quoting and encodings.

Setup (optional virtual environment)

Create and activate a virtual environment if you prefer to isolate dependencies:

python3 -m venv venv
source venv/bin/activate

main.py — CSV → RAG document converter

Create a file named main.py and paste the following code. This single script reads sample_data.csv (or another CSV you set) and creates a list of documents ready for ingestion.

# main.py
import csv
import json
from pathlib import Path
from typing import List, Dict, Optional

def parse_csv_for_rag(csv_file_path: str, output_file_path: Optional[str] = None) -> List[Dict]:
    """
    Parse a CSV file into a list of documents suitable for RAG ingestion.
    Each document contains:
      - id: a generated id (doc_{row_index})
      - text: a concatenated text representation of non-empty fields
      - metadata: includes source, row_number, and the original CSV fields
    """
    documents: List[Dict] = []

    try:
        with open(csv_file_path, 'r', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            fieldnames = reader.fieldnames or []
            print(f"Found columns: {fieldnames}")

            for idx, row in enumerate(reader):
                # Build a readable text representation from non-empty fields
                text_parts = []
                for key, value in row.items():
                    if value is not None:
                        value_str = value.strip()
                        if value_str:
                            text_parts.append(f"{key}: {value_str}")

                text = " | ".join(text_parts)

                # Copy row fields into metadata (preserve as strings)
                row_metadata = {k: (v if v is not None else "") for k, v in row.items()}

                document = {
                    "id": f"doc_{idx}",
                    "text": text,
                    "metadata": {
                        "source": str(csv_file_path),
                        "row_number": idx,
                        **row_metadata
                    }
                }

                documents.append(document)

    except FileNotFoundError:
        print(f"Error: {csv_file_path} not found!")
        return []
    except Exception as e:
        print(f"Error reading {csv_file_path}: {e}")
        return []

    print(f"Processed {len(documents)} documents")

    # Optionally save to JSON file
    if output_file_path:
        try:
            with open(output_file_path, 'w', encoding='utf-8') as outfile:
                json.dump(documents, outfile, indent=2, ensure_ascii=False)
            print(f"Saved documents to {output_file_path}")
        except Exception as e:
            print(f"Error writing {output_file_path}: {e}")

    return documents


def main():
    # Input and output file paths (adjust as needed)
    csv_file = "sample_data.csv"
    output_file = "rag_documents.json"

    if not Path(csv_file).exists():
        print(f"Error: {csv_file} not found!")
        return

    documents = parse_csv_for_rag(csv_file, output_file)

    if documents:
        # Print a sample document and the total number of chunks
        print("\nSample document:")
        print(json.dumps(documents[0], indent=2, ensure_ascii=False))
        total = len(documents)
        print(f"\nTotal Chunks: {total}")


if __name__ == "__main__":
    main()

Document schema (what each parsed item contains)

Field	Type	Description	Example
`id`	string	Generated identifier for the document (`doc_{row_index}`)	`doc_0`
`text`	string	Human-readable concatenation of non-empty `key: value` pairs	`id: 1	first_name: Victor	…`
`metadata`	object	Contains `source`, `row_number`, and all original CSV fields as string values	`{ "source": "sample_data.csv", "row_number": 0, "id": "1", "first_name": "Victor" }`

How the parser works (summary)

Uses csv.DictReader to map header names to values per row.
For each row:
- Builds a text string by concatenating non-empty key: value pairs separated by | for easy searchability.
- Copies all original CSV fields into metadata (preserving string values).
- Adds source and row_number to metadata.
- Generates a top-level document id using the doc_{row_index} pattern.
Appends each document to a list and optionally writes the full list to a JSON file for ingestion.

Run the script

Place your CSV (for example, sample_data.csv) in the same directory as main.py. Then run:

python main.py

On success you should see the columns found, the processed document count, and a sample output. Example console output:

Found columns: ['id', 'first_name', 'last_name', 'email', 'ip_address', 'favorite_animal']
Processed 1000 documents
Saved documents to rag_documents.json

Sample document:
{
  "id": "doc_0",
  "text": "id: 1 | first_name: Victor | last_name: Heliet | email: vheliet@pagesperso-orange.fr | ip_address: 225.97.9.194 | favorite_animal: Panthera leo persica",
  "metadata": {
    "source": "sample_data.csv",
    "row_number": 0,
    "id": "1",
    "first_name": "Victor",
    "last_name": "Heliet",
    "email": "vheliet@pagesperso-orange.fr",
    "ip_address": "225.97.9.194",
    "favorite_animal": "Panthera leo persica"
  }
}

Total Chunks: 1000

Notes, tips, and next steps

One CSV row maps to one RAG “chunk” (document). If you need different chunking strategies (e.g., split long text fields or combine rows), update parse_csv_for_rag accordingly.
Header names are used as-is. Normalize column names (trim whitespace, lower-case, replace spaces) if you have inconsistent sources.
The script generates a top-level id and will also include any id column from the CSV inside metadata. Avoid naming collisions if your downstream store expects a single unique id.
For larger CSVs (tens of thousands of rows), consider streaming rows and writing documents incrementally (or sending them directly to your vector store) to avoid high memory use.
After generating rag_documents.json, ingest it into your vector database/retriever using the connector or ingestion script supported by your vector store.

If your CSV contains nested JSON, fields with embedded commas, or multi-line values, ensure fields are properly quoted. For complex inputs, use a robust CSV library or preprocessor that properly handles quoting, escaping, and multi-line fields to avoid corrupted records.

Links and references

This parser offers a straightforward, reproducible method to convert tabular CSV data into RAG-ready documents with both human-readable text and full metadata preservation.

Introduction

Document Processing and Chunking

Keyword Search & Retrieval

Semantic Search & Embeddings

Vector Databases

Building the RAG Pipeline

Demo Ingesting CSV

Setup (optional virtual environment)

main.py — CSV → RAG document converter

Document schema (what each parsed item contains)

How the parser works (summary)

Run the script

Notes, tips, and next steps

Links and references

Watch Video

​Setup (optional virtual environment)

​main.py — CSV → RAG document converter

​Document schema (what each parsed item contains)

​How the parser works (summary)

​Run the script

​Notes, tips, and next steps

​Links and references

Watch Video

Setup (optional virtual environment)

main.py — CSV → RAG document converter

Document schema (what each parsed item contains)

How the parser works (summary)

Run the script

Notes, tips, and next steps

Links and references