Demystifying Advanced SageMaker

In this lesson we summarize several advanced Amazon SageMaker features visible in the console, organized around common problems and pragmatic solutions. The goal is to give a high‑level, actionable view of when to use foundation models, Bedrock, and HyperPod clusters — and how to choose between them. Topics covered:

Foundation models and fine‑tuning
Bedrock as an alternative for hosted foundation models
HyperPod clusters for large‑scale distributed training
How to decide between SageMaker and Bedrock for different needs

Problem → Solution: foundation models

Problem 1 — Why building models from scratch is expensive

Traditional ML workflows often require:

Large, well‑labeled datasets
Significant compute (often GPUs) for training
Specialized expertise to build, tune, and deploy models
Long development cycles before a model is production ready

These constraints can make building models from scratch slow and costly.

A presentation slide titled "Problem: Traditional Machine Learning Models" that lists four challenges: extensive labeled data for training, high computational resources, expertise to fine-tune and deploy models, and long development cycles before production.

Solution — Foundation models

A foundation model is a large, pre‑trained model that you can use as‑is or adapt for a specific task. SageMaker supports importing and hosting many foundation models (both open source and vendor artifacts), enabling rapid experimentation and deployment for NLP, vision, speech, code generation, and more. Common examples:

LLaMA (Meta) — https://ai.meta.com/llama/
Falcon — https://falconllm.tii.ae/
Stable Diffusion — https://stability.ai/blog/stable-diffusion-public-release
Vendor-hosted models like GPT‑4 (OpenAI) and Claude (Anthropic)

Availability for self‑hosting depends on licensing and vendor restrictions: open models are more likely to be imported and hosted in your SageMaker account, while some proprietary models are only offered via vendor APIs. Two ways to use foundation models in SageMaker:

Host a model artifact in SageMaker and create an endpoint for inference.
Fine‑tune the model in SageMaker and then deploy the tuned artifact.

Fine‑tuning tip: use parameter‑efficient methods (for example LoRA or other PEFT techniques) so you only update a small subset of parameters. This reduces compute, storage, and cost compared to full‑parameter retraining: run a standard SageMaker training job, produce a custom artifact, and deploy it to an endpoint.

A presentation slide titled "Solution: Foundation Models" with a subheading "Models for text, vision, and speech." Below are three circular icons labeled Natural Language Processing (NLP), Image Generation, and Code Generation.

A presentation slide titled "Solution: Foundation Models" showing a "Ready-to-use models" label and icons for several AI models. The slide displays GPT-4, LLaMA, Claude, FalconGPT, Stable Diffusion, and Falcon.

Browsing and selecting models in SageMaker Studio

In SageMaker Studio you can browse providers and models, inspect metadata, and import model artifacts directly into your account. Provider cards indicate model counts and sometimes show whether a model is “Bedrock Ready,” helping you select an appropriate model for import or hosted inference.

Bedrock — a hosted alternative for foundation models

Amazon Bedrock is a separate AWS service that provides managed access to Bedrock‑ready foundation models via a unified API. Key benefits:

Standardized API with pay‑per‑call pricing (no infra provisioning)
Automatic scaling and fully managed hosting
Ability to swap provider/model identifiers without changing client logic beyond the model name

Bedrock is ideal when you want quick, hosted access to vendor models and do not need custom infra control or deep fine‑tuning.

Bedrock and SageMaker solve different needs: choose Bedrock for simple, hosted model access with pay‑per‑call billing and minimal infrastructure work. Choose SageMaker if you require full control over model artifacts, training workflows, VPC isolation, or advanced fine‑tuning.

Decision table: SageMaker vs Bedrock

Resource	Best for	When to choose
Bedrock	Managed, hosted model inference	Quick prototyping or production inference without infra management; pay-per-call billing
SageMaker	Full control (training, hosting, networking)	Custom training/fine‑tuning, private deployments (VPC), specialized endpoints, or self‑hosting artifacts

A presentation slide titled "Solution: Bedrock Alternative for Foundation Models" showing three teal-highlighted boxes with icons and captions: "Provision an endpoint," "Deploy the model," and "Manage scaling and costs." Each box contains a simple line icon (shield/hand, rocket, and money bags with a gear).

A presentation slide titled "Solution: Foundation Models" that lists three SageMaker use cases: full model control (deploy a custom Sonnet), heavy fine-tuning (more access than Bedrock), and private deployment (SageMaker endpoint for security/compliance).

Problem → Solution: HyperPod for massive distributed training

Problem — training very large models is hard

Training extremely large models often requires:

Huge compute and memory capacity (multi‑GPU, multi‑node)
Efficient, low‑latency, high‑bandwidth networking
Orchestration across hundreds or thousands of nodes
Fault tolerance and automatic recovery to protect long-running jobs

A presentation slide titled "Problem: Training Large-Scale Foundation Models" that lists requirements for training such models: massive compute resources with high memory and parallel processing, distributed multi‑GPU training, optimized cost‑efficient infrastructure, and seamless orchestration of thousands of compute instances.

Solution — HyperPod clusters

HyperPod clusters are a managed, high‑performance compute environment inside SageMaker built for coordinating very large distributed training jobs. Highlights:

SLURM‑based controller to orchestrate controller and worker nodes
Job resilience with automatic restarts and node reallocation on failure
Network topology optimized for low latency and high bandwidth
Integration with shared storage (for example Amazon FSx for Lustre) and lifecycle scripts stored in Amazon S3

HyperPod is designed for large runs (hundreds of nodes, many GPUs). For single‑node or small multi‑GPU jobs, standard SageMaker training jobs are simpler and more cost‑effective.

An AWS architecture diagram of an Amazon SageMaker HyperPod HPC cluster. It shows users connecting via AWS Systems Manager into a VPC with Amazon FSx for Lustre, controller/login/worker nodes running Slurm components, and lifecycle scripts stored in an S3 bucket.

When to use HyperPod

Use HyperPod for large foundation model training across many nodes when you need optimized networking and strong fault recovery.
Avoid HyperPod for short experiments or models that fit on a few GPUs — prefer regular SageMaker training jobs in those cases.

Summary and quick recommendations

Foundation models give immediate access to large pre‑trained models for NLP, vision, speech, and code tasks; you can self‑host in SageMaker or call vendor models via Bedrock.
Fine‑tuning with PEFT methods (for example LoRA) makes customizing foundation models practical without re‑training billions of weights.
Bedrock is a fully managed, pay‑per‑call service suitable for teams that want hosted model access with minimal infra management.
HyperPod clusters provide SLURM‑based orchestration and optimized networking for very large, distributed training jobs that require fault tolerance and high throughput.

Links and references

SageMaker documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
Amazon Bedrock overview: https://aws.amazon.com/bedrock/
SLURM scheduler: https://slurm.schedmd.com/
Amazon FSx for Lustre: https://aws.amazon.com/fsx/lustre/
LoRA paper: https://arxiv.org/abs/2106.09685
PEFT (Hugging Face PEFT): https://github.com/huggingface/peft

Use these links to dive deeper into specific features and to evaluate architecture choices for your workloads.

​Problem → Solution: foundation models

​Problem 1 — Why building models from scratch is expensive

​Solution — Foundation models

​Browsing and selecting models in SageMaker Studio

​Bedrock — a hosted alternative for foundation models

​Decision table: SageMaker vs Bedrock

​Problem → Solution: HyperPod for massive distributed training

​Problem — training very large models is hard

​Solution — HyperPod clusters

​When to use HyperPod

​Summary and quick recommendations

​Links and references

Watch Video

Problem → Solution: foundation models

Problem 1 — Why building models from scratch is expensive

Solution — Foundation models

Browsing and selecting models in SageMaker Studio

Bedrock — a hosted alternative for foundation models

Decision table: SageMaker vs Bedrock

Problem → Solution: HyperPod for massive distributed training

Problem — training very large models is hard

Solution — HyperPod clusters

When to use HyperPod

Summary and quick recommendations

Links and references