Skip to main content
This lesson explains core machine learning concepts and the fundamentals of model training and inference. We’ll follow a practical example — a tabular dataset of London house prices — to show what happens during training, how simple models like linear regression work, how algorithms (for example, XGBoost or Linear Learner) interact with data, and how iterative optimization minimizes loss to produce a deployable model. We cover:
  • What training data must contain.
  • How an algorithm creates a model artifact from data.
  • How inference works on a hosted prediction endpoint.
  • The math behind linear regression and loss minimization.
  • How training generalizes to multiple features and the risk of overfitting.
To begin, supervised machine learning requires training examples that include both inputs (features) and the target value you want the model to predict. Our running example is a CSV table of London house sales. Each row contains features such as number of bedrooms, number of bathrooms, square footage, and postcode/area. The sale price is the target (the value we want the model to predict). During training the algorithm learns how combinations of features map to the target so the trained model can predict sale prices for new, unseen properties.
Your training dataset must include the target value for each training example. Supervised learning cannot learn the input→output mapping without that target column.
After preparing data you pick an algorithm (for example, XGBoost, LightGBM, Linear models, k-NN). Algorithms are pre-built methods that extract patterns and produce a mathematical representation of the relationship between features and targets. Training runs the chosen algorithm against your labeled data and outputs a model artifact (for example, model.tar.gz or model.tgz) that encodes the learned parameters. Once training completes, host the model artifact on a prediction platform (a server, virtual machine, or a managed service such as AWS SageMaker). An inference request provides the same input features used in training but omits the target; the model returns a predicted value (for example, “£320,000”).
A flowchart titled "Machine Learning Basics" that outlines the pipeline from training data and an algorithm through the training process to produce a trained model. The model is hosted on a prediction platform which takes new (no-target) data to produce inference predictions.
Although deep learning and LLMs receive a lot of attention, tabular problems (linear and logistic regression, tree-based models such as XGBoost) account for many industry use cases. Mastering them offers practical value for business forecasting, risk scoring, and many production ML roles. Common ML applications:
  • Classifying objects (e.g., fraudulent transaction vs. legitimate).
  • Forecasting trends (e.g., next-month sales).
  • Identifying non-obvious relationships for business intelligence and decision-making.
A presentation slide titled "Machine Learning Basics" showing three panels: "Classifying objects," "Forecasting trends," and "Identifying relationships." Each panel has a simple icon (cube with magnifier, rising bar chart, and connected nodes) illustrating the task.
Linear regression — building intuition
  • We predict house price from a single input feature (for example, property size) to introduce the idea of fitting.
  • Plot the (x, y) points (size vs. price). A line that approximates these points is the model: it gives a rule to predict y from a new x.
An educational slide titled "Linear Regression — Understanding the Math" showing a scatter plot with blue data points and an orange best-fit line. It also displays the formula f(x) = ax + b with a labeled as slope and b as the y-intercept.
Mathematical form
  • A line is commonly written as f(x) = a x + b. In ML we usually write f(x) = w x + b where:
    • w is the weight (coefficient) and controls the slope.
    • b is the bias (intercept) and shifts the line vertically.
  • The line typically does not pass exactly through all points. The vertical distance from an observed point to the line is the residual (error).
To measure how well the line fits the data we square each residual and sum them across all training examples. This sum of squared residuals (ordinary least squares loss) prevents positive and negative residuals from canceling out. Training adjusts w and b to minimize this loss — searching for the line of best fit. Training is an optimization procedure: the algorithm updates parameters, recomputes the loss, and repeats until it reaches a minimum (local or global).
A presentation slide titled "Role of Algorithm in Model Training" with a blue rounded box labeled "Algorithm" on the left and three colored bullet panels on the right. The panels list that algorithms extract patterns from data, enable accurate predictions on unseen data, and reduce error during training.
Multivariate (multiple features)
  • With several input features (for example, bedrooms, bathrooms, square footage, age), linear regression generalizes to a weighted sum:
    • f(x) = w1*x1 + w2*x2 + ... + wn*xn + b
  • Each feature xi gets its own weight wi. As the number of features grows (tens to hundreds), the parameter space becomes high-dimensional.
  • Larger models can capture more complexity but are more prone to overfitting (learning training set noise rather than general patterns). Practical training uses validation data and regularization to manage this trade-off.
A presentation slide titled "Role of Algorithm in Model Training" showing a scatter plot of car age (years) versus sale price with a fitted trend line and arrows indicating residuals for each data point. The chart illustrates how the model's predictions compare to actual sale prices over time.
Glossary (compact)
TermMeaningExample
Weight (w)Coefficient applied to a featureIn y = 2x + 4, w = 2
Bias (b)Intercept term, shifts predictionsIn y = x + 2, b = 2
A presentation slide titled "Role of Algorithm in Model Training" that explains the linear model f(x) = wx + b with w, b, and x defined. To the right is a graph showing several example lines (y = x, y = 2x, y = 2x + 4, y = x + 2).
Optimization and practical training tips
  • Many algorithms use gradient-based optimization (for example, gradient descent) which computes the gradient of the loss with respect to each parameter and updates parameters in the direction that reduces loss.
  • The learning rate controls the update step size:
    • Too large → risk of overshooting minima and unstable training.
    • Too small → slow convergence and high compute cost.
  • Stopping criteria: maximum number of iterations/epochs, minimum improvement threshold, or early stopping based on validation loss. These help prevent wasted compute and reduce overfitting.
Algorithm selection at a glance
Algorithm familyWhen to useTypical strengths
Linear modelsSimple relationships, interpretabilityFast, low variance
Tree-based (XGBoost, LightGBM)Tabular data with mixed feature typesHigh accuracy, handles heterogeneity
k-NNSmall datasets, non-parametricSimple, few assumptions
Neural networksComplex interactions, large datasetsHigh capacity, requires more data
Be cautious with many features or overly long training runs: they increase the risk of overfitting and unnecessary compute costs. Use held-out validation data, regularization, and early stopping to monitor generalization.
Summary — common workflow
  1. Prepare training data: include features and the target column for each example.
  2. Choose an algorithm suitable for the problem and data modality (tabular, text, images).
  3. Train iteratively to minimize a loss function (for regression, often sum of squared errors).
  4. Validate and tune hyperparameters (learning rate, regularization, model complexity).
  5. Export the trained model artifact and host it on a prediction platform to serve inference requests (new inputs without targets).
  6. Monitor model performance in production and retrain as needed with new data.
Links and references If you want, I can add a short worked example showing how to compute the sum of squared residuals and a simple gradient descent update for w and b using Python-style pseudocode.

Watch Video