How Much Math Do I Need Part 2

When preparing numerical data for machine learning, one of the fundamental preprocessing steps is scaling. Feature scaling makes numeric attributes comparable so that models don’t give undue importance to features that simply have larger numeric ranges. This guide covers the three most common scaling techniques:

Min–Max scaling (feature-wise)
Normalization (row-wise / unit norm)
Standardization (z-score, feature-wise)

We explain what each method does, when to use it, and include practical scikit-learn examples. Why scale?

Algorithms that rely on distances (k-NN, k-means, SVM) or gradient-based optimization often perform better when features are on similar scales.
Without scaling, a feature with large numeric values (e.g., house size in square feet) can dominate another feature with smaller numeric ranges (e.g., number of bedrooms), even if both are equally informative.

Use case example: house price prediction with numeric features that differ in magnitude — house size (hundreds to thousands) and number of bedrooms (1–10). Scaling ensures no single feature dominates the model simply because of its numeric range.

A presentation slide titled "Scaling" showing the min–max scaling formula on the left. On the right are two feature panels for "House Size" and "No. of Bedrooms" with slider-like buttons showing normalized values 0.0, 0.25, 0.50, 0.75, and 1.0.

Min–Max Scaling (Feature-wise)

Min–Max scaling rescales each feature independently to a fixed range, typically [0, 1]. This preserves the relationships among the original values but bounds them.

Formula: x’ = (x − min(x)) / (max(x) − min(x))
Operates column-wise (feature-wise)
Use case: When you need bounded inputs or your algorithm is sensitive to absolute value ranges (k-NN, SVM, neural networks with activation functions)

Example (scikit-learn MinMaxScaler):

from sklearn.preprocessing import MinMaxScaler

data = [[1], [2], [3], [4], [5]]  # Example single-feature values
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

# Output:
# [[0.  ]
#  [0.25]
#  [0.5 ]
#  [0.75]
#  [1.  ]]

Use Min–Max when you want all features to share a common bounded scale (for instance, when feeding input into models or visualizations that assume 0–1 ranges).

Normalization (Row-wise, Unit Norm)

Normalization rescales each sample (row) to unit length (norm = 1). This is a row-wise operation and is useful when vector direction matters more than magnitude — for example, cosine-similarity comparisons in NLP or when working with TF-IDF vectors.

Formula: x’ = x / ||x|| where ||x|| is typically the Euclidean (L2) norm
Operates row-wise (sample-wise)
Use case: Sparse data, text (TF-IDF), or any vector-space model where magnitude is irrelevant and only direction matters

A slide titled "Normalization Scaling" explaining that normalization scales each data row to length 1 and is used for sparse data (e.g., NLP, image pixels). It also shows the formula x' = x / ||x|| and notes dividing each value by the row's Euclidean norm.

How the Euclidean norm works:

For a row [200000, 3], the L2 norm is sqrt(200000^2 + 3^2) ≈ 200000.x
Dividing each element by that norm produces a unit-length vector (sum of squared components = 1)

Example (scikit-learn Normalizer):

import pandas as pd
from sklearn.preprocessing import Normalizer

# Example data: [house_price, num_bedrooms]
data = pd.DataFrame({
    'house_price': [200000, 300000, 400000, 500000, 600000],
    'num_bedrooms': [3, 4, 5, 6, 7]
})

normalizer = Normalizer()
normalized_data = normalizer.fit_transform(data)
normalized_df = pd.DataFrame(normalized_data, columns=data.columns)

print("Original data:\n", data)
print("\nNormalizer normalized data:\n", normalized_df)

Normalization is not the same as min–max scaling. Normalization rescales rows (samples) to unit length; min–max rescales features (columns) to a fixed range.

Standardization (Z-score, Feature-wise)

Standardization centers each feature on zero mean and scales to unit variance. This is a column-wise transformation and is often the default preprocessing for many statistical and machine learning algorithms.

Formula: z = (x − μ) / σ where μ is the feature mean and σ is the feature standard deviation
Operates column-wise (feature-wise)
Use case: Models that assume centered inputs or benefit from normalized variance (linear/logistic regression, SVM, PCA, gradient-based methods)

A presentation slide titled "Standardization Formula" showing the z-score equation z = (x − μ) / σ. It also lists that μ is the mean of the feature and σ is its standard deviation.

Why standardize?

Improves convergence for gradient-based optimizers
Prevents features with larger numeric scales from dominating models
Makes feature variances comparable

Example (scikit-learn StandardScaler):

from sklearn.preprocessing import StandardScaler

data = [[1], [2], [3], [4], [5]]  # Example single-feature values
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print("Standardized data:\n", standardized_data)

After standardization, most values are typically within a few standard deviations of the mean (not strictly limited to [-1, 1]).

A slide titled "Understanding Standardization" showing two tables of house size and bedroom counts before and after standardization. The right table lists standardized z-scores and notes both features have mean 0 and standard deviation 1.

Standardization and the normal distribution:

For approximately normal features: ~68% of values fall within ±1σ, ~95% within ±2σ, and ~99.7% within ±3σ.
Centering helps many algorithms behave more predictably and improves numerical stability.

A bell-shaped normal distribution chart titled "Understanding Standardization" showing the ±1σ, ±2σ, and ±3σ intervals with segment percentages (34.1%, 13.6%, 2.1%, 0.1%). It highlights that about 68% of values fall within ±1σ of the mean.

A stylized normal (bell) curve titled "Understanding Standardization," showing shaded regions and percentage labels for each standard-deviation band (34.1%, 13.6%, 2.1%, 0.1%) illustrating the 99.7% empirical rule.

Quick Comparison: Scaling vs Normalization vs Standardization

Method	Goal	Focus	Output Range	When to use
Min–Max Scaling	Rescale features to a fixed range	Column-wise (feature)	Bounded, typically [0, 1]	When bounded inputs are required or for algorithms sensitive to absolute ranges (k-NN, SVM, neural nets)
Normalization (Unit norm)	Scale each sample to unit length (		x		= 1)	Row-wise (sample)	Usually values in [0,1] for non-negative data; interpreted as direction	For sparse or text data (TF-IDF) and when cosine similarity/direction matters
Standardization (Z-score)	Center features to mean 0 and scale to unit variance	Column-wise (feature)	Unbounded, usually within a few σ of mean	For algorithms that assume centered inputs or use variance information (linear/logistic regression, SVM, PCA)

A slide titled "Comparison" showing a table that compares three preprocessing techniques—Scaling, Normalization, and Standardization—by their goal, range, focus (columns or rows), and when to use them. The table highlights differences like scaling to a specific range (e.g., 0–1), normalization giving row unit length, and standardization centering data to mean=0, std=1.

Summary & Recommendations

Choose Min–Max scaling when you need bounded features (0–1) or are feeding values to models sensitive to absolute ranges.
Use Normalization (unit norm) when working with sparse vectors or text (TF-IDF), and when only vector direction matters.
Prefer Standardization for algorithms that assume zero-mean or when stabilizing and speeding up gradient-based training.

​Min–Max Scaling (Feature-wise)

​Normalization (Row-wise, Unit Norm)

​Standardization (Z-score, Feature-wise)

​Quick Comparison: Scaling vs Normalization vs Standardization

​Summary & Recommendations

Watch Video

Min–Max Scaling (Feature-wise)

Normalization (Row-wise, Unit Norm)

Standardization (Z-score, Feature-wise)

Quick Comparison: Scaling vs Normalization vs Standardization

Summary & Recommendations