Optimized training with managed jobs accelerates development and reduces cost by automating hyperparameter search and letting you right-size compute (instance type and instance count) per job.

- SageMaker training jobs run your training script inside managed containers. Point the job at your training data (commonly in S3), provide the algorithm or framework, and supply an entry point script. SageMaker pulls the correct container image for the specified framework or algorithm.
- The SageMaker Python SDK exposes an Estimator base class and framework-specific subclasses (e.g., TensorFlow, PyTorch). Instantiating an Estimator configures the container image, compute, and runtime behavior for your training job.
- Hyperparameter tuning can be automated with SageMaker HyperParameter Tuning Jobs. You provide ranges, maximum total/concurrent jobs, and a search strategy (e.g., grid or random). SageMaker runs parallel training jobs, evaluates the objective metric, and returns the best hyperparameter configuration.
- Compute sizing is an Estimator property. Use instance_type and instance_count to meet the scale of your workload (single-instance, multi-GPU, or distributed across instances).
| Parameter | Purpose | Example |
|---|---|---|
| image_uri | ECR container image used to run your training code | 123456789012.dkr.ecr.us-west-2.amazonaws.com/my-training-image:latest |
| role | IAM role used by SageMaker to access S3, ECR, CloudWatch, etc. | arn:aws:iam::123456789012:role/SageMakerRole |
| instance_type | Compute instance type for training (CPU/GPU) | ml.m5.xlarge, ml.p3.2xlarge |
| instance_count | Number of instances for distributed training | 1, 2, 4 |
| entry_point | Training script that runs inside the container | train.py |
| source_dir | Directory packaged and uploaded with the training job | src/ |
| hyperparameters | Dictionary of hyperparameters passed to your script | {'epochs': 10, 'learning_rate': 0.01} |
Replace placeholder values (ECR image URIs, IAM role ARNs, and S3 paths) with your environment-specific resources. Ensure the IAM role has permissions for S3, ECR, CloudWatch, and SageMaker actions.

- Try this end-to-end in SageMaker Studio to see training job logs, metrics, and artifacts in real time.
- For more details: