- A human-readable
textfield (concatenated non-empty columns), - A
metadataobject that preserves the original CSV fields and source information, - An optional JSON export that you can ingest into a vector store or retriever pipeline.
- reads a CSV file,
- converts each row into a document with
id,text, andmetadata, - optionally writes the list of documents to a JSON file for ingestion.
CSV quirks to watch for: headers may include leading/trailing whitespace, values may be empty, and fields can contain commas or newlines. The script below handles common cases (skips empty values, uses UTF-8). For inconsistent or complex CSV sources, add normalization (trim headers, unify casing) or a preprocessor that handles quoting and encodings.
Setup (optional virtual environment)
Create and activate a virtual environment if you prefer to isolate dependencies:main.py — CSV → RAG document converter
Create a file namedmain.py and paste the following code. This single script reads sample_data.csv (or another CSV you set) and creates a list of documents ready for ingestion.
Document schema (what each parsed item contains)
| Field | Type | Description | Example | ||
|---|---|---|---|---|---|
id | string | Generated identifier for the document (doc_{row_index}) | doc_0 | ||
text | string | Human-readable concatenation of non-empty key: value pairs | `id: 1 | first_name: Victor | …` |
metadata | object | Contains source, row_number, and all original CSV fields as string values | { "source": "sample_data.csv", "row_number": 0, "id": "1", "first_name": "Victor" } |
How the parser works (summary)
- Uses
csv.DictReaderto map header names to values per row. - For each row:
- Builds a
textstring by concatenating non-emptykey: valuepairs separated by|for easy searchability. - Copies all original CSV fields into
metadata(preserving string values). - Adds
sourceandrow_numbertometadata. - Generates a top-level document
idusing thedoc_{row_index}pattern.
- Builds a
- Appends each document to a list and optionally writes the full list to a JSON file for ingestion.
Run the script
Place your CSV (for example,sample_data.csv) in the same directory as main.py. Then run:
Notes, tips, and next steps
- One CSV row maps to one RAG “chunk” (document). If you need different chunking strategies (e.g., split long text fields or combine rows), update
parse_csv_for_ragaccordingly. - Header names are used as-is. Normalize column names (trim whitespace, lower-case, replace spaces) if you have inconsistent sources.
- The script generates a top-level
idand will also include anyidcolumn from the CSV insidemetadata. Avoid naming collisions if your downstream store expects a single unique id. - For larger CSVs (tens of thousands of rows), consider streaming rows and writing documents incrementally (or sending them directly to your vector store) to avoid high memory use.
- After generating
rag_documents.json, ingest it into your vector database/retriever using the connector or ingestion script supported by your vector store.
If your CSV contains nested JSON, fields with embedded commas, or multi-line values, ensure fields are properly quoted. For complex inputs, use a robust CSV library or preprocessor that properly handles quoting, escaping, and multi-line fields to avoid corrupted records.
Links and references
- Python csv.DictReader documentation
- Vector database fundamentals (for ingestion)
- RAG fundamentals course