Transforming Data Combining

You’ve ingested and cleaned separate source files, but the business questions you need to answer require a single, enriched table. This lesson shows how to combine multiple tables, resolve column conflicts, compute derived fields, and produce summary tables for reporting and analysis using pandas. What you’ll learn:

Rename conflicting fields to avoid ambiguity when merging datasets.
Merge datasets (orders, products, customers) into a single, analyzable table.
Compute derived metrics (line totals) and produce aggregation summaries (top products/customers).

The image features a man speaking next to a presentation slide with tips on managing datasets, accompanied by a cartoon wolf character.

Prerequisites:

Working in a Jupyter Notebook or similar environment.
pandas installed: https://pandas.pydata.org/
Data files loaded or available in a data folder.

Scenario

It’s Oct 1 and September orders have arrived in your data folder. To answer:
- Top 3 products by revenue — requires orders + products.
- Top 3 customers by spend — requires orders + products + customers.
Goal: produce a single orders table with customer_id and product_id replaced by human-readable customer_name and product_name, plus a line_total column computed as quantity × price.

High-level process

Locate incoming files and consult the ingest log to prevent duplicate processing.
Clean raw orders (drop invalid rows, fix date formats) and save the cleaned output.
Rename conflicting columns in reference tables (products, customers).
Merge product and customer reference data into cleaned orders.
Compute line_total and create aggregation summaries for reporting.

File discovery and ingest-log check

Start by ensuring your folder structure exists and check whether the orders file was already ingested. This prevents duplicate processing and accidental reprocessing.

# python
import os
import pandas as pd
from datetime import datetime

data_folder = "data"
archive_folder = os.path.join(data_folder, "archive")
insights_folder = "insights"
logs_folder = "logs"
log_path = os.path.join(logs_folder, "ingest_log.csv")

# Ensure folders exist
for folder in [data_folder, archive_folder, insights_folder, logs_folder]:
    os.makedirs(folder, exist_ok=True)
print("Folder structure and paths set up.")

# Find the orders file in the data folder
files = os.listdir(data_folder)
file_name = next((f for f in files if "orders" in f.lower()), None)

if not file_name:
    print("❌ No orders file found.")
else:
    file_path = os.path.join(data_folder, file_name)
    file_id = os.path.splitext(file_name)[0]
    print(f"✅ Found file: '{file_name}'")

    # Check ingest log for duplicates
    if os.path.exists(log_path):
        log = pd.read_csv(log_path)
        if file_name in log["file_name"].values:
            print(f"❌ File '{file_name}' already ingested - skipping.")
        else:
            print(f"File '{file_name}' not ingested before - proceed to next step.")
    else:
        print("No ingest log found — proceed to ingestion.")

Resolve column conflicts before merging

When merging lookup/reference tables into fact tables, shared column names like name can cause overwrites or confusing duplicated columns. Rename these fields in reference tables first.

Rename columns early (e.g., name → product_name / customer_name) so merges produce predictable columns and you avoid accidental overwrites.

The image shows a man standing in front of a computer screen displaying a Jupyter Notebook interface with a list of CSV files and a script for data transformation tasks.

Cleaning orders

Remove rows with invalid customer_id or product_id, record dropped rows for auditing, and save the cleaned orders file. This pattern preserves traceability and keeps your downstream joins reliable.

# python
# Assumptions: orders_raw, customers, products are pre-loaded DataFrames
output_folder = "output"
os.makedirs(output_folder, exist_ok=True)

# Work on a copy to preserve raw data
orders = orders_raw.copy()

# Identify invalid customer_id rows
valid_customer_ids = set(customers["customer_id"].unique())
invalid_customer_mask = ~orders["customer_id"].isin(valid_customer_ids)
invalid_customer_rows = orders[invalid_customer_mask]

# Identify invalid product_id rows
valid_product_ids = set(products["product_id"].unique())
invalid_product_mask = ~orders["product_id"].isin(valid_product_ids)
invalid_product_rows = orders[invalid_product_mask]

# Combine invalid rows (avoid duplicates if a row fails both checks)
dropped_rows = pd.concat([invalid_customer_rows, invalid_product_rows]).drop_duplicates(subset=["order_id"])

# Report and save dropped rows
if not dropped_rows.empty:
    dropped_path = os.path.join(output_folder, "orders_dropped.csv")
    dropped_rows.to_csv(dropped_path, index=False)
    print(f"✗ Removed {len(dropped_rows)} rows; saved details to: {dropped_path}")
else:
    print("✅ No rows were dropped due to invalid customer_id or product_id.")

# Remove invalid rows from orders
orders_clean = orders.loc[~orders["order_id"].isin(dropped_rows["order_id"])].reset_index(drop=True)

# (Optional) Drop helper columns, parse dates, etc.
if "order_date_parsed" in orders_clean.columns:
    orders_clean = orders_clean.drop(columns=["order_date_parsed"])

# Save cleaned orders
cleaned_path = os.path.join(output_folder, "orders_clean.csv")
orders_clean.to_csv(cleaned_path, index=False)
print(f"Saved cleaned orders to: {cleaned_path}")

Rename reference columns (examples)

Keep a consistent naming pattern to make downstream analysis and dashboards easier to read.

Original column	Recommended rename
`name` (in products)	`product_name`
`name` (in customers)	`customer_name`
`id` or `product_id`	`product_id` (keep consistent across datasets)

Merge, compute, and preview

Merge reference data into the cleaned orders with left joins so you retain all order rows and attach product and customer attributes. Then compute a line_total column.

# python
# Load cleaned orders (if not already in memory)
orders = pd.read_csv(cleaned_path)

# Rename 'name' column in products and customers to avoid conflict
products_renamed = products.rename(columns={"name": "product_name"})
customers_renamed = customers.rename(columns={"name": "customer_name"})

# Merge product info into orders (left join preserves all orders)
orders = orders.merge(
    products_renamed[["product_id", "product_name", "price"]],
    on="product_id",
    how="left"
)

# Merge customer info into orders
orders = orders.merge(
    customers_renamed[["customer_id", "customer_name"]],
    on="customer_id",
    how="left"
)

# Calculate line total
orders["line_total"] = orders["quantity"] * orders["price"]

# Preview selected columns (first 10 rows)
orders_preview = orders[["order_id", "customer_name", "product_name", "quantity", "price", "line_total"]].head(10)
print("Order Calculations (first 10 rows):")
print(orders_preview.to_string(index=False))

Aggregate to answer business questions

Use groupby + sum to compute total revenue by product and total spend by customer, then sort and select the top 3.

# python
# Top 3 products by total revenue
top_products = (
    orders.groupby("product_id", as_index=False)["line_total"]
    .sum()
    .rename(columns={"line_total": "total_revenue"})
    .merge(products_renamed[["product_id", "product_name"]], on="product_id", how="left")
    .sort_values("total_revenue", ascending=False)
    .head(3)
)

# Top 3 customers by total spend
top_customers = (
    orders.groupby("customer_id", as_index=False)["line_total"]
    .sum()
    .rename(columns={"line_total": "total_spend"})
    .merge(customers_renamed[["customer_id", "customer_name"]], on="customer_id", how="left")
    .sort_values("total_spend", ascending=False)
    .head(3)
)

print("\nTop 3 Products by Revenue:")
print(top_products[["product_name", "total_revenue"]].to_string(index=False))

print("\nTop 3 Customers by Spend:")
print(top_customers[["customer_name", "total_spend"]].to_string(index=False))

Best practices recap

Rename fields early to avoid column name collisions when joining data.
Use left joins to retain all orders while bringing in reference data.
Validate foreign keys (customer_id, product_id) and record dropped rows for traceability.
Use grouping and aggregation to summarize revenue and spend for reporting or visualization.
Keep consistent column naming across your pipeline for maintainability.

Always check the ingest log and archive processed files to prevent duplicate ingestions and ensure idempotent pipelines.

Next steps

Hands-on practice: run this workflow on a sample month of orders and verify the audit trail (dropped rows, ingest log).
Extend: add date-based rollups (daily/monthly revenue), or create a dashboard-ready summary table.

Links and references

pandas documentation
Jupyter
For more on data validation and pipeline design, consult platform-specific guides (Airflow, Prefect) and your organization’s ETL standards.

Introduction

Ingesting Data

Transforming Data - Cleaning

Transforming Data - Combining

Automation and Orchestration

Serving Data

Transforming Data Combining

Watch Video