Skip to main content
How do you create a SageMaker training job? Programmatically — using an estimator object from the SageMaker Python SDK. An estimator encapsulates the training-job configuration: compute resources (for example, ml.c5.24xlarge), input data locations (Amazon S3 paths), and the output location where the model artifact (TGZ) will be saved. When you call an estimator’s fit(), SageMaker provisions the requested instance(s), pulls the chosen container image with the algorithm, runs training, and tears down the instances when the job completes. Because SageMaker manages the underlying Amazon EC2 instances, you don’t manage them directly and are billed only while the training resources are running.
A slide titled "Workflow: Estimator Object Class" that explains an estimator represents a machine learning training job. Three boxes note its responsibilities: sets up computing resources, manages data input and storage, and runs training on AWS SageMaker.

Estimator class and convenience subclasses

The Estimator base class represents generic training jobs. SageMaker also provides convenience subclasses for many built-in algorithms and frameworks (for example, LinearLearner, XGBoost wrappers, scikit-learn, PyTorch, TensorFlow). These subclass wrappers automatically pick the correct container image for the algorithm so you don’t need to specify an image URI manually. Below is a concrete example using the LinearLearner estimator subclass for a regression task. It shows creating the estimator, specifying instance type/count, S3 input/output, hyperparameters, and launching training with .fit().
Replace the example role ARN with your execution role or use get_execution_role() in a SageMaker notebook. Ensure the Amazon S3 paths exist and are accessible to the execution role.
# linear_learner_estimator.py
import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker.amazon.linear_learner import LinearLearner

# SageMaker session and role
session = sagemaker.Session()
role = 'arn:aws:iam::123456789012:role/SageMakerRole'  # replace with your role or use get_execution_role()

# Define S3 paths
s3_input_path = 's3://your-bucket/house-price-data/input/'
s3_output_path = 's3://your-bucket/house-price-data/output/'

# Create a Linear Learner estimator (using hyperparameters dictionary)
linear_estimator = LinearLearner(
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    predictor_type='regressor',  # 'regressor' for linear regression
    output_path=s3_output_path,
    sagemaker_session=session,
    hyperparameters={
        'epochs': 20,
        'optimizer': 'adam',
        'learning_rate': 0.01,
        'wd': 0.001,               # weight decay (L2)
        'normalize_data': True,
        'loss': 'absolute_loss'
    }
)

# Prepare data input
train_input = TrainingInput(s3_input_path, content_type='text/csv')

# Launch the training job
linear_estimator.fit({'train': train_input})

print(f"Model artifacts saved to: {s3_output_path}")
This simple example will:
  • Provision the instance(s),
  • Pull the LinearLearner container,
  • Read training data from S3,
  • Run training using the specified hyperparameters,
  • Write the model artifact (TGZ) to the specified output path.
Using the SDK keeps your code compact and focused on the ML task rather than infrastructure plumbing.

Custom containers and the base Estimator

If you need a custom container image or an algorithm wrapper not available as a convenience class, use the Estimator base class and supply an image URI. The example below shows how to retrieve a SageMaker-provided XGBoost image URI and construct a base Estimator.
# custom_estimator_xgboost.py
import sagemaker
from sagemaker import Estimator

session = sagemaker.Session()
role = 'arn:aws:iam::123456789012:role/SageMakerRole'  # replace as needed

# Retrieve the SageMaker XGBoost image for the current region and a specified version
xgboost_image_uri = sagemaker.image_uris.retrieve('xgboost', session.boto_region_name, version='1.0-1')

estimator = Estimator(
    image_uri=xgboost_image_uri,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='s3://your-bucket/output/',
    sagemaker_session=session
)

# Launch the training job with a custom container image (S3 path or TrainingInput accepted)
estimator.fit({'train': 's3://your-bucket/input/train.csv'})

print("Training completed. Model artifacts saved to S3.")

Hyperparameters — controlling training behavior

Hyperparameters are preset configuration values that control training behavior. They are passed to the training container and affect optimization, regularization, preprocessing, and loss computation. Many convenience estimator classes accept a hyperparameters dictionary; otherwise set them in your training script or container. Common hyperparameters and considerations:
HyperparameterPurposeTypical considerations
epochsNumber of full passes over the training datasetHigher values can improve fit but may overfit; common ranges vary by dataset size (e.g., 10–100)
learning_rateStep size for weight updatesToo large can overshoot; too small slows convergence
optimizerOptimization algorithm (e.g., ‘adam’, ‘sgd’)Different optimizers converge differently; choose based on task and dataset
batch_size (mini-batch)Number of samples per parameter updateAffects memory footprint and convergence stability
wd (weight decay)L2 regularization strengthPenalizes large weights to reduce overfitting
normalize_data / normalize_labelPreprocessing flagsUse only if your dataset hasn’t been pre-normalized externally
A presentation slide titled "Workflow: HyperParameters" that defines hyperparameters as preset configurations for a machine learning algorithm before training. It lists three points: algorithm-specific settings for model training; can be set explicitly for LinearLearner; and SageMaker uses defaults if not specified.

Regularization and preprocessing hyperparameters

Regularization helps prevent overfitting and improves generalization:
  • L1 regularization (sparsity): pushes some weights toward zero, which can effectively remove irrelevant features.
  • L2 regularization (weight decay): penalizes large weights to produce smoother models.
Preprocessing flags (for example, normalize_data or normalize_label) instruct the container to perform scaling/normalization before training. Use these only if your data pipeline hasn’t already standardized the features/labels.

Loss function

The loss function describes what the training process minimizes. For regression, common choices include:
  • absolute loss (L1): sum of absolute residuals — more robust to outliers
  • squared loss (L2): sum of squared residuals — penalizes large errors more heavily
Selecting a loss function affects sensitivity to outliers and convergence dynamics.

Automated hyperparameter tuning (SageMaker Hyperparameter Tuning)

Manually searching hyperparameters is time-consuming. SageMaker Hyperparameter Tuning automates this by launching multiple training jobs (trials) across a defined hyperparameter search space and selecting the best trial based on an objective metric (for example, validation RMSE or validation accuracy). You must define:
  • objective_metric_name: the metric to optimize and whether to minimize or maximize,
  • hyperparameter_ranges: continuous or discrete ranges for each hyperparameter,
  • max_jobs: total number of trials,
  • max_parallel_jobs: number of concurrent trials,
  • metric_definitions: regex patterns to extract the objective metric from training logs (ensure the regex matches the container’s log format).
Example: building a HyperparameterTuner around an XGBoost estimator.
# hyperparameter_tuning_xgboost.py
import sagemaker
from sagemaker import Estimator
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter

session = sagemaker.Session()
role = 'arn:aws:iam::123456789012:role/SageMakerRole'  # replace as needed

# Create the XGBoost estimator (using SageMaker-provided XGBoost container)
xgboost_image_uri = sagemaker.image_uris.retrieve('xgboost', session.boto_region_name, version='1.0-1')
xgboost_estimator = Estimator(
    image_uri=xgboost_image_uri,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='s3://your-bucket/output/',
    sagemaker_session=session
)

# Define hyperparameter search space
hyperparameter_ranges = {
    'learning_rate': ContinuousParameter(0.01, 0.2),
    'max_depth': IntegerParameter(3, 12)  # max_depth is an integer parameter
}

# Define tuner
tuner = HyperparameterTuner(
    estimator=xgboost_estimator,
    objective_metric_name='validation:accuracy',  # metric to optimize
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=10,
    max_parallel_jobs=2,
    # Ensure the regex matches how the training container logs the metric
    metric_definitions=[{
        'Name': 'validation:accuracy',
        'Regex': 'validation-accuracy:([0-9\\.]+)'
    }]
)

# Launch tuning job (passes training and validation data channels)
tuner.fit({
    'train': 's3://your-bucket/train-data/',
    'validation': 's3://your-bucket/validation-data/'
})
The tuner will run up to max_jobs training trials and return the best hyperparameter set according to the specified objective metric. Ensure the metric extraction regex matches the container’s log output so SageMaker can parse the metric successfully.

Quick summary

  • Use estimator subclasses for built-in algorithms (LinearLearner, XGBoost wrappers, etc.) — the SDK chooses the correct container image.
  • Use the base Estimator to supply a custom container image.
  • Configure hyperparameters to control optimization, regularization, preprocessing, and loss.
  • Use SageMaker Hyperparameter Tuning to automatically search for the best hyperparameters; define search space, objective metric, and job counts.
  • Always ensure S3 data paths and IAM execution roles are correctly configured and permissioned.

Watch Video