Skip to main content
In the previous lesson we introduced linear regression with a single feature. Here we extend that to a more realistic dataset with multiple features. Using the car-price example, we show how feature encoding, scaling, training, and hosting come together to produce predictions.

Multivariate inputs: moving to higher dimensions

Real datasets include many attributes: age, color, sunroof, mileage, alarm, and more. As you add features you move into higher-dimensional input space (five features ⇒ five dimensions). While visualization becomes difficult beyond three dimensions, the mathematics stays the same: the model combines numeric inputs with learned weights to make predictions. To train models, every input must be numeric. Categorical and boolean features must be encoded; continuous features may need scaling. The diagram below illustrates encoding multiple car features before feeding them to a learning algorithm.
A slide diagram titled "Training With Multiple Features" showing example car features (Age, Color, Sunroof, Mileage, Alarm) being encoded into numerical data and then used for training a model. Two features (Color and Sunroof) are annotated as not having a numerical value and need encoding.

Encoding and preprocessing — best practices

Common encodings and preprocessing choices:
Feature typeTypical encodingWhen to use
Categorical (unordered)One-hot encodingUse for nominal categories like color
Categorical (ordered)Label / ordinal encodingUse only when categories have natural order
Boolean0 / 1Direct binary representation for flags (sunroof, alarm)
NumericalScaling / normalization (standardize or min-max)Keep feature magnitudes comparable (e.g., mileage)
Key points:
  • One-hot is preferred for unordered categories to avoid implying an order.
  • Boolean flags map cleanly to 0/1.
  • Scale numeric features (e.g., express mileage in thousands or standardize) so learned weights have reasonable magnitudes.

Feature symbols and weights

Assign a symbol to each input feature and a corresponding weight (coefficient) that indicates its influence:
  • X1 = age
  • X2 = color (encoded)
  • X3 = sunroof (0/1)
  • X4 = mileage
  • X5 = alarm (0/1)
Weights w1 … w5 can be positive or negative. Some weights are expected to be large (e.g., mileage), others small (e.g., color). The next illustration highlights how mileage strongly influences price.
A slide titled "Training With Multiple Features" showing car features on the left (Age, Color, Sunroof, Mileage, Alarm) with Mileage highlighted. To the right are two car illustrations labeled with different mileages (100,000 vs 20,000) and corresponding prices (cheaper vs much more expensive).

Linear model — combining features

A simple linear model predicts a target as a weighted sum of features plus a bias:
A slide titled "Training With Multiple Features" lists car-related features (Age, Color, Sunroof, Mileage, Alarm) paired with weights w1–w5. To the right is the linear model equation f(x) = w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + b, showing how features are combined to make a prediction.
Algebraically:
# Linear model (symbolic)
# f(x) = w1*x1 + w2*x2 + w3*x3 + w4*x4 + w5*x5 + b

Training objective and loop

Training optimizes parameters (w1…w5 and b) to minimize a loss function. For regression, the mean squared error (MSE) or sum of squared errors is common. For a single sample:
# Squared error for a single example
# error = (f(x) - y) ** 2
Training proceeds iteratively:
  1. Initialize parameters (randomly or with sensible defaults).
  2. Compute predictions f(x) for training samples.
  3. Compute the loss (how far predictions differ from targets).
  4. Adjust parameters to reduce the loss (gradient descent or another optimizer).
  5. Repeat until convergence or another stopping condition.
The following diagram summarizes prediction, error computation, and parameter updates in a loop.
A diagram titled "Training Process" showing three steps: the model makes predictions (using weights w1…w5 and bias b), you compare the prediction to the actual value (error = (f(x)−y)^2), and you adjust parameters to minimize the error. A looped arrow and the caption "Repeat process" indicate iterating these steps.
Gradient descent intuition: imagine the loss surface as a valley — at each step compute the slope (gradient) and move the parameters downhill until you reach a (local) minimum.

Numeric example

Here’s an example showing trained parameters and a prediction (mileage scaled to thousands):
# Example trained model parameters (numeric). Mileage scaled to thousands.
w1, w2, w3, w4, w5 = -1.1, 0.4, 1.1, -0.8, 0.2
b = 25.0  # baseline price in thousands

# Example feature vector (x1..x5). Mileage expressed in thousands (20.0 => 20,000 miles).
x = [5.0, 2.0, 1.0, 20.0, 0.0]  # age, encoded color, sunroof, mileage (thousands), alarm

# Prediction (price in thousands)
f_x = w1 * x[0] + w2 * x[1] + w3 * x[2] + w4 * x[3] + w5 * x[4] + b
Reminder: consistent scaling between training and inference is crucial — many teams scale features (e.g., divide mileage by 1,000) to keep weights interpretable and numerically stable.

From training to hosting and inference

Training uses labeled data (inputs with known targets), enabling the model to compare predictions to ground truth and improve. After training, you deploy (host) the trained model on a compute platform (virtual machine, container, on-prem server, or managed service like SageMaker). The hosted model receives new input data (same features, without targets) and returns predictions by applying the learned function f(x).
A slide titled "Summary" that lists four ML inference steps: train a model with labeled data, host it for inference, use new data with the same features but no targets, and generate predictions. It mentions the model's learned function f(x).

Key takeaways

  • ML models learn numeric relationships between features and a target by tuning weights and biases.
  • Encode categorical and boolean features numerically before training (prefer one-hot for unordered categories).
  • Scale numerical features to keep weights at reasonable magnitudes and improve optimizer behavior.
  • Training minimizes a loss function (e.g., squared error) using optimization methods such as gradient descent.
  • Ensure the same preprocessing pipeline is applied to training and inference data to avoid serving errors.
Always ensure your training and inference data use the same feature format and preprocessing (encoding and scaling). Mismatches between training and serving preprocessing are a common source of errors.

Further reading and references

This completes the lesson. Future material will cover the full ML pipeline: ideation → data preparation → training → deployment → monitoring and inference.

Watch Video