Skip to main content
Encoding is the final data-transformation technique covered here that relies on basic mathematical ideas. Encoding converts categorical (text) features into numeric values so most machine learning algorithms (which operate on numbers) can use them. For example, categorical color labels like “red” or “green” must be converted into numeric representations such as 0, 1, 2 before training. Why encoding matters:
  • Machine learning models require numeric input.
  • Different encoding strategies introduce different assumptions (e.g., order vs. independence).
  • Choosing the right encoding affects model performance, dimensionality, and risk of leakage.

Label encoding

Label encoding assigns a unique integer to every category in a feature. Consider a Neighborhood feature with values “Downtown”, “Suburbs”, and “Rural”. Label encoding might map these to 1, 2, and 3 respectively:
The image is a "Label Encoded" diagram showing neighborhood categories (Downtown, Suburbs, Rural) mapped to numeric labels 1, 2, and 3. A caption advises using the encoded numerical feature of neighborhood for training the model.
Example mapping:
  • Downtown → 1
  • Suburbs → 2
  • Rural → 3
After encoding you can drop the original categorical column and keep the numeric labels for training. Pros:
  • Simple and compact (single column).
Cons:
  • Imposes an ordinal relationship (1 < 2 < 3) that may not be meaningful. Models could interpret the numeric order as a ranking or distance, biasing predictions.
When to use:
  • When the categorical variable is ordinal (has a meaningful order), or when the algorithm you use can handle nominal labels without misinterpreting ordering.

One-hot encoding

To avoid introducing a false order, one-hot encoding creates binary indicator columns (flags) for each category. For Neighborhood you would create Neighborhood_Downtown, Neighborhood_Suburbs, and Neighborhood_Rural. Each row has a 1 for the category it belongs to and 0 for the others. After adding the new columns you drop the original categorical column. Example using scikit-learn:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Example data
data = pd.DataFrame({
    'Neighborhood': ['Downtown', 'Suburbs', 'Rural'],
    'Price': [300000, 200000, 150000]
})

# One-hot encoding
encoder = OneHotEncoder(sparse_output=False)  # or sparse=False for older scikit-learn
encoded = encoder.fit_transform(data[['Neighborhood']])

# Add encoded features back to the DataFrame
encoded_columns = encoder.get_feature_names_out(['Neighborhood'])
encoded_df = pd.DataFrame(encoded, columns=encoded_columns)
data = pd.concat([data, encoded_df], axis=1).drop(columns=['Neighborhood'])

print(data)
Key notes:
  • One-hot encoding removes implied order by creating independent binary features.
  • It expands the feature space — each unique category becomes a column.
Use one-hot encoding for nominal categorical variables (no natural order). Be mindful that one-hot can significantly increase dimensionality when the category count grows.

High cardinality and the downside of one-hot encoding

One-hot encoding works well for low-cardinality features, but when a categorical feature has many unique values (high cardinality) it produces a wide, sparse dataset. This can increase model complexity, memory usage, and training time without improving predictive power—especially for features such as postcodes, user IDs, or product SKUs.
A presentation slide titled "Target Encoding" shows a table illustrating one-hot encoding of postcodes for three IDs (columns for Postcode 1001/1002/1003 with 1s and 0s). A caption below notes that one-hot encoding creates a wide dataset with many columns for each unique postcode.
When cardinality is high, consider compact encodings like target (mean) encoding or hashing tricks.

Target encoding (mean encoding)

Target encoding replaces each category with a statistic of the target variable computed over that category—most commonly the mean target value. This reduces dimensionality while preserving the relationship between the categorical feature and the target. Typical steps:
  1. Group data by the categorical variable (e.g., postcode).
  2. Compute the mean (or other statistic) of the target for each group.
  3. Replace the categorical value with that statistic.
A presentation slide titled "Target Encoding" that lists three steps: group data by a categorical variable (e.g., postcode), calculate the mean (or other statistic) of the target (e.g., house price) for each group, and replace the categorical value with its mean target value. The steps are shown as three numbered boxes with brief explanations.
Benefits:
  • Dimensionality reduction: one numeric column instead of many indicator columns.
  • Preserves a relationship between the categorical feature and the target.
  • Efficient for high-cardinality features.
A presentation slide titled "Target Encoding" with three numbered panels. The panels list benefits: 01 Dimensionality reduction (one column instead of many), 02 Preserves relationships (captures connection between feature and target), and 03 Efficient (handles high cardinality better).
Example — replace postcodes with mean house price:
  • Compute the mean house price per postcode.
  • Substitute that mean as the encoded value for every row with that postcode.
A presentation slide titled "Target Encoding" showing a table that replaces postcodes with mean house prices. The table lists example postcodes (A1, B2, C3) with their house prices and the corresponding target-encoded mean values.
Simple implementation with pandas:
import pandas as pd

# Example data
data = {
    'Postcode': ['A1', 'B2', 'A1', 'C3', 'B2'],
    'HousePrice': [300000, 250000, 320000, 150000, 270000]
}
df = pd.DataFrame(data)

# Global mean of house prices (fallback)
global_mean = df['HousePrice'].mean()

# Calculate the mean house price per postcode
postcode_means = df.groupby('Postcode')['HousePrice'].mean()

# Replace each postcode with its target mean and fill unseen with global mean
df['TargetEncodedPostcode'] = df['Postcode'].map(postcode_means).fillna(global_mean)

print(df)
Practical considerations:
  • Smoothing: combine per-category statistics with the global statistic to reduce variance for rare categories.
  • Handling unseen categories: use a global mean or a special fallback value.
  • Use regularization or weight by category size to prevent noisy estimates from small groups.
Target encoding can leak target information if applied naively (computing encodings on the full dataset). Prevent leakage using out-of-fold (cross-validated) encodings, train-only computations, or smoothing techniques.

Quick comparison of encoding strategies

Encoding methodBest forProsCons
Label encodingOrdinal categoriesCompact, simpleImposes order that may be meaningless
One-hot encodingLow-cardinality nominal categoriesNo implicit order, interpretableCan explode feature space for high cardinality
Target encodingHigh-cardinality categories correlated with targetCompact, captures target relationshipsRisk of target leakage, needs smoothing/out-of-fold schemes

Summary / Practical checklist

  • Outliers: detect with methods like IQR; decide whether to drop, cap (Winsorize), or otherwise transform them.
  • Scaling: pick scaling appropriate to model type — standardization (zero mean, unit variance), min-max scaling, or normalization (e.g., L2) for distance-based methods.
  • Categorical data: convert categories to numeric values using an encoding strategy that matches the data and model:
    • Label encoding for ordinal categories.
    • One-hot encoding for low-cardinality nominal features.
    • Target encoding (with out-of-fold or smoothing) for high-cardinality features.
  • Avoid leakage: never compute encodings on the full dataset before splitting; use cross-validation/out-of-fold approaches.
  • Tools: use libraries like pandas, NumPy, and scikit-learn for preprocessing tasks.
Useful references: We’ll also take a look at why AWS SageMaker, the product, is seen as so mysterious and intimidating and help get you past that barrier.

Watch Video