Turning up Reliability on Management Services

Welcome back, Solutions Architects. In this lesson, we explore how to enhance reliability across AWS management and governance services. We’ll cover provisioning, observability, and various management tools, providing insights into their inherent design and how they support resilience without requiring additional configuration.

Provisioning

AWS CloudFormation is the native provisioning tool of AWS, forming the backbone of resource creation and management. Even if you leverage third-party tools like Terraform or code-based solutions such as the AWS Cloud Development Kit (CDK) version 2, it ultimately generates CloudFormation templates. This means that understanding CloudFormation is essential, regardless of the provisioning method you select.

The image illustrates the process of using AWS CloudFormation for creating and configuring stack resources, highlighting steps from template creation to stack deployment. It notes that CloudFormation is primarily a provisioning tool with limited direct impact on reliability.

CloudFormation is built for inherent reliability—there are no additional configuration options or custom retry policies. Similarly, the AWS CDK, while providing a code-driven approach to create CloudFormation templates, does not extend reliability features beyond those already available in CloudFormation.

The image illustrates the architecture of the AWS Cloud Development Kit (CDK) version 2, showing how constructs like Amazon SQS, AWS Lambda, Amazon S3, and AWS DynamoDB are organized within stacks and deployed using AWS CloudFormation. It also mentions that the CDK is a code-based extension of CloudFormation with minimal reliability design considerations.

For operations teams, it is crucial to have visibility into the generated resources for troubleshooting. Generating a CloudFormation template from a CDK application is straightforward, thus enabling effective tracking and management of these resources.

The image presents a question about which AWS CDK feature would enable visibility into resources for troubleshooting reliability issues, with four possible options listed.

Observability

Observability is key to managing application performance and ensuring operational reliability. AWS CloudWatch provides a robust suite of monitoring features including logs, alarms, events, and insights. Designed for high availability, CloudWatch serves as a central hub for tracking metrics, troubleshooting issues, and guiding performance optimizations.

The image is a presentation slide about "Designing for Reliability – CloudWatch," featuring a screenshot of the AWS CloudWatch interface and a description of its reliability for logging and metrics.

In production, when performance issues arise, CloudWatch consolidates metrics through dashboards, logs, and events, allowing for in-depth analysis. Additionally, AWS Health and the Personal Health Dashboard offer operational status insights similar to Security Hub or Migration Hub—without the need for extra reliability configurations.

The image shows a slide titled "Designing for Reliability – AWS Health or Personal Health Dashboard," featuring a screenshot of an AWS event log with various operational issues listed as closed. It also includes a note about the Personal Health Dashboard being a highly managed status board.

Other Management Services

AWS offers a broad range of management services that are inherently resilient and designed to function without manual reliability tweaks.

Managed Prometheus and Grafana

Managed Prometheus automatically scales and recovers, offering a plug-and-play experience without server-level configuration changes. When integrated with Amazon Managed Grafana, a unified dashboard view is achieved with the same managed-reliability design.

The image is a diagram illustrating the setup for a managed service for Prometheus, showing how metrics are ingested from various sources like Grafana Cloud Agent and Prometheus server into Amazon Managed Service for Prometheus, and then queried by Amazon Managed Service for Grafana. It highlights the automatic scaling and recovery features of the service.

The image is a diagram illustrating the integration of Amazon Managed Grafana with an Amazon Redshift cluster, showing data flow and components like the Redshift data source plugin and web browser dashboard. It highlights features like autoscaling and autohealing.

For organizations seeking a single pane of glass for application metrics, Amazon Managed Grafana provides a highly dependable dashboard without requiring manual reliability adjustments.

The image presents a scenario where a company is migrating its application monitoring stack to AWS, requiring highly available dashboards. It lists four approaches to meet reliability requirements, including using Amazon Managed Service for Grafana and deploying Grafana servers on EC2 instances.

Trusted Advisor and Compute Optimizer

AWS Trusted Advisor and Compute Optimizer help optimize fault tolerance and cost efficiency. Trusted Advisor, with fault tolerance checks available under specific support plans, and Compute Optimizer, offering resource optimization recommendations, both operate with inherent reliability without requiring any configuration adjustments.

The image presents a question about which AWS tool provides the most comprehensive recommendations for improving fault tolerance, with options including AWS Trusted Advisor, Amazon CloudWatch Synthetics, AWS X-Ray, and AWS Compute Optimizer.

The image shows a dashboard from AWS Compute Optimizer with recommendations for EC2 instances, highlighting over-provisioned instances and suggesting optimized instance types. It also mentions that Compute Optimizer is used to enhance performance, scale, cost, and reliability.

The image presents a scenario where a company needs AWS service recommendations for optimizing EC2 resource configuration, highlighting AWS Compute Optimizer as the suitable service for this requirement.

Organizations and Control Tower

While AWS Organizations is focused on account management rather than service reliability, AWS Control Tower automates account setup with best practices, such as service control policies and auditing. This automation inherently promotes resiliency across multiple accounts.

The image presents a question about which AWS service would help enforce policies and controls across multiple accounts, with four options: AWS Organizations, AWS Config, AWS CloudTrail, and AWS Trusted Advisor.

The image is a diagram illustrating the AWS Control Tower setup for designing reliability, showing workflows involving AWS services like EventBridge, Lambda, S3, CodePipeline, and CloudFormation. It highlights best practices for setting up AWS accounts with reliability considerations.

Systems Manager

AWS Systems Manager provides extensive operational capabilities to manage both cloud and on-premises resources efficiently. It is engineered to support critical tasks such as OS patching and compliance checks reliably without additional configuration.

The image is a diagram illustrating the AWS Systems Manager for designing reliability, showing its integration with various AWS services and components like VPC, CloudWatch, and S3. It highlights the management of IoT devices, corporate data centers, and other cloud providers.

Service Catalog and License Manager

Service Catalog allows you to create and manage portfolios of AWS services while enforcing company policies. This standardization streamlines deployments, though it does not independently adjust reliability beyond that provided by the underlying services.

The image is a diagram titled "Designing for Reliability – Service Catalog," illustrating the process of creating and managing product portfolios, distributing them to organizations or users, and maintaining the catalog. It includes steps like obtaining products, organizing portfolios, managing constraints, and using a TagOption Library for consistent resource tagging.

Similarly, AWS License Manager centralizes software license compliance, ensuring that license management is as reliable as its design intends to be.

The image is a slide titled "Designing for Reliability – License Manager," showing an AWS License Manager dashboard with license configurations and usage alerts. It mentions that AWS License Manager is already designed for reliability.

The image presents a scenario where a company needs to manage software license compliance across 100 EC2 instances, with AWS License Manager identified as the suitable service for this requirement.

Proton

AWS Proton is designed for container and microservices deployments, incorporating built-in reliability measures like automatic multi-AZ deployments and data backup/restore functionalities. Although Proton offers a few extra reliability "knobs" for specific configurations (such as scheduled backups), its overall design prioritizes high resiliency out-of-the-box.

The image illustrates a diagram of AWS Proton's design for reliability, showing development and production accounts with ECS clusters, Fargate services, and a CI/CD pipeline. It highlights AWS Proton's data backup and restore functions for containers and databases.

The image presents a scenario where a company needs to integrate reliable backups into an AWS architecture, offering four approaches: using AWS Proton components, adding a Lambda function, configuring native application backups, or using AWS Proton environment templates.

Tag Editor, Resource Explorer, Resource Groups, and Resource Access Manager

AWS provides several tools to organize and manage your resources effectively:

Tag Editor: Enables efficient tagging across services.
Resource Explorer and Resource Groups: Help in grouping and identifying related resources.
Resource Access Manager: Facilitates secure resource sharing.

Each tool is designed to operate reliably by default, without configurable reliability options.

The image shows a screenshot of a Tag Editor interface, listing various resources like EC2 instances and S3 buckets with their details. It includes a note stating that the Tag Editor is highly managed with no reliability elements.

The image is a slide titled "Designing for Reliability – Resource Group," showing a screenshot of an AWS Systems Manager interface for managing resource groups, specifically DynamoDB tables. It explains that a resource group is a collection of tags with low data sensitivity but high reliability.

Resilience Hub

AWS Resilience Hub assists in assessing and improving the resiliency of your applications by analyzing configurations and suggesting best practices. Although the process involves manual inputs for analysis, the tool does not offer direct modifications to enhance service reliability.

The image shows a screenshot of the AWS Resilience Hub interface, detailing a workflow for assessing and improving application resiliency, alongside a description of its purpose.

Key Takeaway

Most AWS management and governance services are engineered to be inherently reliable. They come with built-in resilience, auto-healing, and seamless integration with observability tools—thus eliminating the need for manual adjustments.

Conclusion

In summary, AWS management and governance services are built on robust, resilient foundations. Whether provisioning through CloudFormation/CDK, monitoring with CloudWatch, or leveraging tools like Trusted Advisor, Control Tower, and Systems Manager, these services are designed to function reliably without additional configuration. Embrace AWS best practices and utilize these tools as intended to ensure optimal performance and resiliency in your environment.

Michael Forrester thanks you for following along in this lesson. If you have any questions, please join the forums or reach out directly at [email protected]. Catch you in the next lesson.

Watch Video

Watch video content

Practice Lab

Practice lab