Skip to main content
In this lesson we’ll walk through how to train a model in Amazon SageMaker Studio and how to monitor training jobs. This is a central lesson: everything up to this point prepares you for controlled, reproducible model training at scale. We will cover:
  • Common challenges for ML model training (compute, algorithms, training code, and iteration).
  • The SageMaker solution: managed training jobs.
  • How to kick off and monitor training from a SageMaker Studio Jupyter notebook using the SageMaker Python SDK.
  • How training jobs produce optimized model artifacts and how compute resources are provisioned and released automatically.
This guide assumes you have a prepared dataset in S3 and a SageMaker execution role. If you need setup instructions, see the SageMaker getting started documentation.

Why training is hard (common problems)

Training at scale introduces operational, cost, and iteration challenges that slow down ML delivery:
  • Infrastructure: deciding where to store training data and where to run training jobs.
  • Experimentation scale: hundreds or thousands of permutations of algorithms, datasets, and hyperparameters.
  • Data movement: copying large datasets to local machines is slow, expensive, or restricted by policies.
  • Local compute limits: laptops and small workstations are often insufficient for larger models — leading to long training times and limited reproducibility.
A dark-themed presentation slide titled "Problem: ML Infrastructure and Iterations" with four numbered panels describing issues: Infrastructure Complexity, Slow Experimentation, Data Processing Overhead, and Training Limitations. Each panel includes an icon and a short explanation about time-consuming infrastructure, tedious manual testing, cumbersome data preprocessing, and slow local training.
Beyond training, you must also consider deployment and production monitoring: hosting the model for inference, detecting model drift, tracking data distribution changes, and sizing resources to balance performance and cost. When iterating on many models you also need model lifecycle management—tracking which models are in training, approved for production, deployed, or flagged for retraining.
A slide titled "Problem: ML Infrastructure and Iterations" displaying four numbered panels. They list Deployment Challenges, Monitoring and Debugging Issues, High Costs, and Workflow Inefficiencies, each with a short explanatory sentence.

SageMaker managed training jobs: concept and benefits

SageMaker simplifies these concerns by encapsulating training in a managed training job. A training job ties together four core ingredients:
ResourcePurposeExample
Compute infrastructureRight-sized instances, optionally distributedml.c5.12xlarge, multi-instance training
Training datasetPrepared features and labels stored in S3s3://my-bucket/data/processed.csv
AlgorithmBuilt-in or custom container for trainingXGBoost, TensorFlow, custom Docker image
Training scriptOrchestration that loads data, trains, and emits artifactsPython training script that writes model.tar.gz
A SageMaker training job provisions the compute, pulls the chosen container (or your custom image), runs the training script, stores inputs/outputs in S3, and tears down compute when the job completes. This provides scalability, reproducibility, and cost control. What is a training job?
  • A managed request to run training on temporary, dedicated compute.
  • Defined from a notebook or CI pipeline, but the heavy compute runs on managed instances (not on your notebook kernel).
  • Configured with instance type/count, algorithm image, S3 input locations, and S3 output for artifacts.
A diagram illustrating an AWS SageMaker training jobs workflow, showing containers from an Elastic Container Registry feeding into a SageMaker Training Job and JupyterLab space. It also shows S3 storage for input data (processed.csv) and model output (model.tgz).
Benefits of SageMaker training jobs:
  • Right-size compute and pay only for the time used.
  • Easily run distributed training across multiple instances.
  • Use built-in algorithm containers to reduce boilerplate.
  • Support for popular frameworks (TensorFlow, PyTorch, scikit-learn).
  • Managed hyperparameter tuning to accelerate experiments.
  • Integration with S3 for efficient data access and artifact storage.
  • Optionally use Spot instances for cost savings (with trade-offs).

Example: define and run a training job from a SageMaker Studio notebook

Below is a concise Python example using the SageMaker Python SDK (v2). Update the role, bucket, and S3 URIs for your environment.
# python
from sagemaker import Session
from sagemaker.session import TrainingInput
from sagemaker.estimator import Estimator
from sagemaker import image_uris

# Set up session and names
session = Session()
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"  # replace with your role ARN
bucket = "my-sagemaker-bucket"  # replace with your S3 bucket
region = session.boto_region_name

# Choose a built-in algorithm image (example: linear-learner)
training_image = image_uris.retrieve("linear-learner", region)

# S3 locations
input_s3_uri = f"s3://{bucket}/data/processed.csv"
output_s3_uri = f"s3://{bucket}/models/"

# Define the Estimator
estimator = Estimator(
    image_uri=training_image,
    role=role,
    instance_count=1,
    instance_type="ml.c5.12xlarge",
    volume_size=50,  # GB
    output_path=output_s3_uri,
    sagemaker_session=session,
    hyperparameters={
        "feature_dim": "10",
        "predictor_type": "regressor",
        "mini_batch_size": "100"
    },
)

# Specify training data (as a single file or channel)
train_input = TrainingInput(s3_data=input_s3_uri, content_type="text/csv")

# Start training (creates a managed SageMaker training job)
estimator.fit({"train": train_input})
Tips:
  • For framework containers (TensorFlow, PyTorch), you typically provide a training script and use a Framework estimator.
  • Increase instance_count and use framework-specific distributed configurations for multi-node training.

Monitoring progress and viewing logs

The SDK integrates with CloudWatch and can stream logs into your notebook session.
  • estimator.fit() prints logs to the notebook while the job runs (if invoked interactively).
  • You can describe a training job via the SageMaker API to poll status.
# python
import boto3
sm = boto3.client("sagemaker", region_name=region)

job_name = estimator.latest_training_job.name
resp = sm.describe_training_job(TrainingJobName=job_name)
status = resp["TrainingJobStatus"]  # InProgress, Completed, Failed, Stopped
print(f"Training job '{job_name}' status: {status}")
  • Stream logs programmatically or from the notebook:
# python
estimator.logs(wait=True)  # streams logs until completion
Operational notes:
  • Prepare and validate input data in S3 before launching training (for example: processed.csv created by your preprocessing pipeline).
  • Model artifacts are written back to S3 (e.g., model.tgz or model.tar.gz). Use the artifact to create a SageMaker model for real-time endpoints or to run Batch Transform jobs for offline inference.
  • Use checkpointing and resume strategies for long-running jobs, especially when leveraging Spot instances.
  • Enable distributed training by increasing instance_count and configuring distributed strategies for your chosen framework.
Spot instances can provide large cost savings but are interruptible. If you use spot instances for training, ensure your training code or framework supports checkpointing and automatic resumption, or be prepared to retry interrupted jobs.

Quick decision guide

QuestionSageMaker feature to use
Need repeatable, auditable training?Managed training jobs + S3 artifacts + Job metadata
Running many experiments?Hyperparameter tuning jobs and multiple training jobs with consistent inputs
Need large compute or distributed training?Choose larger instance types or increase instance_count
Want to reduce costs?Use Spot instances and tune job duration/checkpoints
Need to monitor models in production?Use SageMaker Model Monitor and CloudWatch metrics

Summary

  • Use SageMaker training jobs to offload and scale training, reduce data movement, and manage compute costs.
  • Define training jobs from a notebook using the SageMaker SDK; the heavy compute runs on separate managed instances.
  • Monitor training via SDK methods, CloudWatch, or the SageMaker console and stream logs into notebooks for debugging.
  • Store prepared training data and model artifacts in S3 for reproducible runs, deployment, and model tracking.

Watch Video