Guide to cleaning and merging order, product, and customer tables with pandas to compute line totals and produce top product and customer revenue summaries
You’ve ingested and cleaned separate source files, but the business questions you need to answer require a single, enriched table. This lesson shows how to combine multiple tables, resolve column conflicts, compute derived fields, and produce summary tables for reporting and analysis using pandas.What you’ll learn:
Rename conflicting fields to avoid ambiguity when merging datasets.
Merge datasets (orders, products, customers) into a single, analyzable table.
Compute derived metrics (line totals) and produce aggregation summaries (top products/customers).
Prerequisites:
Working in a Jupyter Notebook or similar environment.
It’s Oct 1 and September orders have arrived in your data folder. To answer:
Top 3 products by revenue — requires orders + products.
Top 3 customers by spend — requires orders + products + customers.
Goal: produce a single orders table with customer_id and product_id replaced by human-readable customer_name and product_name, plus a line_total column computed as quantity × price.
High-level process
Locate incoming files and consult the ingest log to prevent duplicate processing.
Clean raw orders (drop invalid rows, fix date formats) and save the cleaned output.
Rename conflicting columns in reference tables (products, customers).
Merge product and customer reference data into cleaned orders.
Compute line_total and create aggregation summaries for reporting.
File discovery and ingest-log check
Start by ensuring your folder structure exists and check whether the orders file was already ingested. This prevents duplicate processing and accidental reprocessing.
# pythonimport osimport pandas as pdfrom datetime import datetimedata_folder = "data"archive_folder = os.path.join(data_folder, "archive")insights_folder = "insights"logs_folder = "logs"log_path = os.path.join(logs_folder, "ingest_log.csv")# Ensure folders existfor folder in [data_folder, archive_folder, insights_folder, logs_folder]: os.makedirs(folder, exist_ok=True)print("Folder structure and paths set up.")# Find the orders file in the data folderfiles = os.listdir(data_folder)file_name = next((f for f in files if "orders" in f.lower()), None)if not file_name: print("❌ No orders file found.")else: file_path = os.path.join(data_folder, file_name) file_id = os.path.splitext(file_name)[0] print(f"✅ Found file: '{file_name}'") # Check ingest log for duplicates if os.path.exists(log_path): log = pd.read_csv(log_path) if file_name in log["file_name"].values: print(f"❌ File '{file_name}' already ingested - skipping.") else: print(f"File '{file_name}' not ingested before - proceed to next step.") else: print("No ingest log found — proceed to ingestion.")
Resolve column conflicts before merging
When merging lookup/reference tables into fact tables, shared column names like name can cause overwrites or confusing duplicated columns. Rename these fields in reference tables first.
Rename columns early (e.g., name → product_name / customer_name) so merges produce predictable columns and you avoid accidental overwrites.
Cleaning orders
Remove rows with invalid customer_id or product_id, record dropped rows for auditing, and save the cleaned orders file. This pattern preserves traceability and keeps your downstream joins reliable.
# python# Assumptions: orders_raw, customers, products are pre-loaded DataFramesoutput_folder = "output"os.makedirs(output_folder, exist_ok=True)# Work on a copy to preserve raw dataorders = orders_raw.copy()# Identify invalid customer_id rowsvalid_customer_ids = set(customers["customer_id"].unique())invalid_customer_mask = ~orders["customer_id"].isin(valid_customer_ids)invalid_customer_rows = orders[invalid_customer_mask]# Identify invalid product_id rowsvalid_product_ids = set(products["product_id"].unique())invalid_product_mask = ~orders["product_id"].isin(valid_product_ids)invalid_product_rows = orders[invalid_product_mask]# Combine invalid rows (avoid duplicates if a row fails both checks)dropped_rows = pd.concat([invalid_customer_rows, invalid_product_rows]).drop_duplicates(subset=["order_id"])# Report and save dropped rowsif not dropped_rows.empty: dropped_path = os.path.join(output_folder, "orders_dropped.csv") dropped_rows.to_csv(dropped_path, index=False) print(f"✗ Removed {len(dropped_rows)} rows; saved details to: {dropped_path}")else: print("✅ No rows were dropped due to invalid customer_id or product_id.")# Remove invalid rows from ordersorders_clean = orders.loc[~orders["order_id"].isin(dropped_rows["order_id"])].reset_index(drop=True)# (Optional) Drop helper columns, parse dates, etc.if "order_date_parsed" in orders_clean.columns: orders_clean = orders_clean.drop(columns=["order_date_parsed"])# Save cleaned orderscleaned_path = os.path.join(output_folder, "orders_clean.csv")orders_clean.to_csv(cleaned_path, index=False)print(f"Saved cleaned orders to: {cleaned_path}")
Rename reference columns (examples)
Keep a consistent naming pattern to make downstream analysis and dashboards easier to read.
Original column
Recommended rename
name (in products)
product_name
name (in customers)
customer_name
id or product_id
product_id (keep consistent across datasets)
Merge, compute, and preview
Merge reference data into the cleaned orders with left joins so you retain all order rows and attach product and customer attributes. Then compute a line_total column.
# python# Load cleaned orders (if not already in memory)orders = pd.read_csv(cleaned_path)# Rename 'name' column in products and customers to avoid conflictproducts_renamed = products.rename(columns={"name": "product_name"})customers_renamed = customers.rename(columns={"name": "customer_name"})# Merge product info into orders (left join preserves all orders)orders = orders.merge( products_renamed[["product_id", "product_name", "price"]], on="product_id", how="left")# Merge customer info into ordersorders = orders.merge( customers_renamed[["customer_id", "customer_name"]], on="customer_id", how="left")# Calculate line totalorders["line_total"] = orders["quantity"] * orders["price"]# Preview selected columns (first 10 rows)orders_preview = orders[["order_id", "customer_name", "product_name", "quantity", "price", "line_total"]].head(10)print("Order Calculations (first 10 rows):")print(orders_preview.to_string(index=False))
Aggregate to answer business questions
Use groupby + sum to compute total revenue by product and total spend by customer, then sort and select the top 3.
# python# Top 3 products by total revenuetop_products = ( orders.groupby("product_id", as_index=False)["line_total"] .sum() .rename(columns={"line_total": "total_revenue"}) .merge(products_renamed[["product_id", "product_name"]], on="product_id", how="left") .sort_values("total_revenue", ascending=False) .head(3))# Top 3 customers by total spendtop_customers = ( orders.groupby("customer_id", as_index=False)["line_total"] .sum() .rename(columns={"line_total": "total_spend"}) .merge(customers_renamed[["customer_id", "customer_name"]], on="customer_id", how="left") .sort_values("total_spend", ascending=False) .head(3))print("\nTop 3 Products by Revenue:")print(top_products[["product_name", "total_revenue"]].to_string(index=False))print("\nTop 3 Customers by Spend:")print(top_customers[["customer_name", "total_spend"]].to_string(index=False))
Best practices recap
Rename fields early to avoid column name collisions when joining data.
Use left joins to retain all orders while bringing in reference data.
Validate foreign keys (customer_id, product_id) and record dropped rows for traceability.
Use grouping and aggregation to summarize revenue and spend for reporting or visualization.
Keep consistent column naming across your pipeline for maintainability.
Always check the ingest log and archive processed files to prevent duplicate ingestions and ensure idempotent pipelines.
Next steps
Hands-on practice: run this workflow on a sample month of orders and verify the audit trail (dropped rows, ingest log).
Extend: add date-based rollups (daily/monthly revenue), or create a dashboard-ready summary table.