An Introduction to Feature Engineering

Feature engineering is the process of transforming prepared dataset columns into representations that are more useful for a chosen machine learning algorithm. It sits between dataset preparation (cleaning, imputation) and model training. Good feature engineering can improve predictive performance, reduce training time, and make models more robust.

We will show practical examples using pandas and scikit-learn and demonstrate how to scale these transformations using SageMaker Processing jobs. Typical feature-engineering tasks include encoding categorical variables, selecting or creating features, transforming skewed data, aggregating information, and scaling numeric inputs.

A presentation slide titled "Agenda" listing four items: Problem (prepared data may not be enough), Solution (apply feature engineering), Workflow (using Pandas, sklearn, and SageMaker Processing Jobs), and Results (better model performance and faster training). The design shows numbered blue markers down the left with the agenda text on a light background.

Why feature engineering matters

Even after basic cleaning (e.g., filling missing values), raw data often needs further transformation:

Irrelevant or redundant features slow training and may reduce model quality.
Noise or outliers can bias learning and harm generalization.
Categorical features must be encoded in ways that reflect their semantics (ordinal vs nominal). Poor choices can mislead models.
Strongly skewed numeric inputs can violate algorithmic assumptions and reduce performance.

A presentation slide titled "Problem: Raw Data May Be Inadequate for ML" showing skewed data (unbalanced distribution, data bias) feeding an ML model. The right panel lists ML model impacts like affected assumptions and biased predictions.

Common feature-engineering activities

Activity	Purpose	Typical methods / tools
Feature selection	Remove irrelevant or redundant inputs to speed training and avoid overfitting	Correlation analysis, feature importance, recursive feature elimination
Categorical encoding	Represent categorical data numerically	One-hot, ordinal, target encoding, hashing, embeddings
Feature extraction	Create informative features from raw values	Date parts, text lengths, tokenization, embeddings
Feature transformation	Reduce skew and stabilize variance	log, sqrt, power transforms, Box-Cox
Feature interactions	Capture multiplicative or composite effects	Multiplication, concatenation, polynomial features
Aggregation / grouping	Add group-level statistics	groupby / pivot_table (mean, sum, count)
Scaling / normalization	Put features on comparable scales	StandardScaler, MinMaxScaler

When features have very high cardinality (for example, postal codes or item IDs), prefer techniques that limit dimensionality (target encoding, hashing, or learned embeddings) instead of naive one-hot encoding, which explodes feature count.

Hands-on examples (pandas + scikit-learn)

Start by loading a small sample dataset into a pandas DataFrame for local experimentation:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

# Sample dataset
data = {
    "house_size": [1500, 1800, 1200, np.nan, 2000],
    "num_bedrooms": [3, 4, 2, 3, np.nan],
    "city": ["New York", "San Francisco", "New York", "Chicago", "San Francisco"],
    "year_built": [2000, 1995, 2010, 2005, 1998],
    "price": [500000, 700000, 350000, 450000, 750000]
}

df = pd.DataFrame(data)
print("Original Data:\n", df)

Encoding categorical variables

One-hot encoding expands nominal categories into binary columns (useful for low-cardinality categorical variables).
Ordinal encoding maps categories to integers when a natural order exists.
For dates, extract year/month/dayofweek with dt accessor.
For text, use length, token counts, or embeddings depending on needs.

# One-hot encoding a nominal column
df_onehot = pd.get_dummies(df, columns=["city"], prefix="city")
print(df_onehot.head())

# Ordinal/label encoding (only when category order is meaningful)
df["city_label"] = df["city"].astype("category").cat.codes
print(df[["city", "city_label"]].head())

# Date example (if you had a 'sale_date' column)
# df['sale_date'] = pd.to_datetime(df['sale_date'])
# df['sale_year'] = df['sale_date'].dt.year
# Text feature example (if you had 'description')
# df['description_length'] = df['description'].str.len()

Be careful with target encoding: if you encode categories using information from the target without proper cross-validation or out-of-fold strategies, you can leak label information and inflate evaluation metrics.

Feature transformations and interactions

Use log or sqrt transforms to reduce skew and moderate the influence of extreme values.
Create interaction features to capture multiplicative or combined effects.

# Log transform price to reduce skew (use log1p to handle zero safely)
df['log_price'] = np.log1p(df['price'])

# Square root transform (example on house_size)
df['sqrt_house_size'] = np.sqrt(df['house_size'])

# Interaction feature: house size multiplied by number of bedrooms
df['size_bed_interaction'] = df['house_size'] * df['num_bedrooms']

Aggregations and grouping

Group-level statistics can be powerful features (e.g., mean price by city). Use groupby or pivot_table and merge results back into the main DataFrame.

# Group by city and compute mean price
city_price_mean = (
    df.groupby('city')
      .agg({'price': 'mean'})
      .rename(columns={'price': 'mean_price_by_city'})
      .reset_index()
)

# Merge the aggregated feature back into df
df = df.merge(city_price_mean, on='city', how='left')
print(df[['city', 'mean_price_by_city']])

Derived features, missing-value handling, dropping redundant columns, and scaling

Create derived features such as age, handle missing values with imputation strategies, drop original columns if redundant, compute ratios like price per square foot, and scale numeric columns for algorithms that require normalized inputs.

from sklearn.preprocessing import StandardScaler

# Derive house age
df["house_age"] = 2024 - df["year_built"]
print("\nAfter Adding House Age Feature:\n", df[["year_built", "house_age"]])

# Handle missing values by imputing with the median
df["house_size"].fillna(df["house_size"].median(), inplace=True)
df["num_bedrooms"].fillna(df["num_bedrooms"].median(), inplace=True)

# Drop redundant original column if we keep 'house_age'
df.drop(columns=["year_built"], inplace=True)

# Create price per square foot
df["price_per_sqft"] = df["price"] / df["house_size"]

# Standardize a numeric column (zero mean, unit variance)
scaler_standard = StandardScaler()
df["house_size_standardized"] = scaler_standard.fit_transform(df[["house_size"]])

print("\nFinal DataFrame:\n", df.head())

Where to execute feature-engineering code at scale

Local development: pandas + scikit-learn on a developer laptop is ideal for prototyping and experiments.
Production-scale or repeatable pipelines: use managed compute and orchestration. SageMaker Processing is a common choice to run these transformations on managed instances and produce reproducible outputs.

SageMaker Processing jobs require:

A container (framework image, e.g., scikit-learn)
A processing script that reads inputs, transforms data, and writes outputs
Instance type and count
IAM role with appropriate permissions

Example: launch a scikit-learn-based SageMaker Processing job with the SageMaker SDK:

from sagemaker.sklearn import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Replace with an appropriate IAM role for SageMaker
sagemaker_role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

sklearn_processor = SKLearnProcessor(
    framework_version="1.2-1",    # scikit-learn container version
    role=sagemaker_role,
    instance_type="ml.m5.large",
    instance_count=1,
    base_job_name="feature-engineering-job"
)

# Run a processing job that executes 'preprocessing.py' in the provided source directory
sklearn_processor.run(
    code="preprocessing.py",  # your preprocessing script that reads input, transforms, writes output
    source_dir="src",         # directory containing preprocessing.py and dependencies
    inputs=[
        ProcessingInput(source="s3://your-bucket/input-data/", destination="/opt/ml/processing/input")
    ],
    outputs=[
        ProcessingOutput(source="/opt/ml/processing/output", destination="s3://your-bucket/output-data/")
    ],
    wait=True
)

Parameter	Purpose	Example
framework_version	Version of the SageMaker scikit-learn container	”1.2-1”
instance_type	Compute instance for processing	”ml.m5.large”
instance_count	Number of instances to run in parallel	1 or more
code / source_dir	Script and code dependencies	preprocessing.py in src/
inputs / outputs	S3 or local paths for input and output artifacts	S3 paths for raw and transformed data

The SageMaker Processing console provides monitoring and logs for jobs and is useful for debugging and auditing runs.

A screenshot of the AWS SageMaker console titled "Workflow: SageMaker Processing Jobs," showing the Processing jobs page with one job listed. The sidebar item and the page header "Processing jobs" are highlighted in red.

Summary and best practices

Feature engineering is essential to present the most informative inputs to a model and usually improves performance and convergence speed.
Choose encodings and transformations intentionally—consider domain knowledge, algorithm assumptions, and feature cardinality.
Prototype locally with pandas and scikit-learn; scale production jobs with SageMaker Processing or other managed services.
Always validate feature changes with proper cross-validation and monitor for data leakage (especially with target-based encodings).

An Introduction to Feature Engineering

Why feature engineering matters

Common feature-engineering activities

Hands-on examples (pandas + scikit-learn)

Encoding categorical variables

Feature transformations and interactions

Aggregations and grouping

Derived features, missing-value handling, dropping redundant columns, and scaling

Where to execute feature-engineering code at scale

Summary and best practices

Links and references

Watch Video

​Why feature engineering matters

​Common feature-engineering activities

​Hands-on examples (pandas + scikit-learn)

​Encoding categorical variables

​Feature transformations and interactions

​Aggregations and grouping

​Derived features, missing-value handling, dropping redundant columns, and scaling

​Where to execute feature-engineering code at scale

​Summary and best practices

​Links and references

Watch Video

Why feature engineering matters

Common feature-engineering activities

Hands-on examples (pandas + scikit-learn)

Encoding categorical variables

Feature transformations and interactions

Aggregations and grouping

Derived features, missing-value handling, dropping redundant columns, and scaling

Where to execute feature-engineering code at scale

Summary and best practices

Links and references