Understanding the Foundations of Reliability in any Design

Welcome, Solutions Architect Associates. In this lesson, we explore how reliability is built into any design—a comprehensive overview that also explains the shared responsibilities between you and AWS.

The Shared Responsibility Model

For the Solutions Architecture Associate exam, it is important to understand that while AWS manages much of the underlying infrastructure, you are responsible for key aspects such as application architecture, change management, and failure response. AWS takes care of foundational services like EC2 (infrastructure as a service) and RDS (platform as a service), allowing you to interact with managed services. For instance, while you can adjust parameters on an RDS instance, services like DynamoDB are fully managed via API calls without direct server access.

The image illustrates the AWS Shared Responsibility Model for reliability, dividing responsibilities between the customer (resilience "in" the cloud) and AWS (resilience "of" the cloud). It highlights areas like infrastructure testing, workload architecture, and AWS global infrastructure components.

When using managed services, AWS assumes most reliability tasks. However, you maintain a crucial role in designing your application's architecture to better mitigate potential failures.

Areas of Focus

Below are four critical areas to consider while designing resilient systems:

Infrastructure Architecture
Ensure that components like networking, storage, and connectivity are designed with redundancy and resilience in mind. Cloud workloads incorporate service quotas (or service limits) to avoid accidental overuse. Understanding these limits for AWS or third-party services is essential.
Resistant architectures commonly feature redundant communication paths and efficient IP address management, as outlined in the Well-Architected Framework. For example, AWS storage services often maintain multiple copies of your data. Services like Aurora might store up to six copies to reinforce reliability.
Application (Service) Architecture
Design your applications to be distributed and resilient. A microservices architecture can ensure that a failure in one component does not bring down the entire system. Define clear service contracts through APIs, SLAs, and SLOs to formalize how components interact.
Strategies to improve reliability include:
- Implementing graceful degradation to maintain service during partial failures.
- Using idempotent operations and rate limiting to prevent cascading issues.
- Applying automatic retry strategies with exponential backoff to handle temporary service endpoint failures.
Additionally, incorporating stateless design principles—where session and configuration data are managed externally—allows services to recover independently.
Change Management
Reliable systems rely on robust change management practices. This includes continuous monitoring of system metrics, automatically scaling resources based on demand, and setting up proactive notifications to flag unexpected changes. Regular load testing, isolating changes to individual components, and automating deployment, testing, and rollback processes are key.
For example, if your service experiences an unexpected surge, automated systems should detect the shift, scale resources accordingly, and notify the administrators. Periodic reviews and enhancements to your change process will further stabilize your system.
Failure Management
Even with robust AWS infrastructure, you must have a plan to handle application-level failures. This involves regular backups, automated restoration processes, and continuous testing of your disaster recovery plans. Techniques such as Chaos Engineering or "game days" can be effective for simulating failures and verifying recovery strategies.
Proactive testing, autoscaling, load testing, and chaos engineering can dramatically improve the mean time between failures and ensure rapid recovery when issues occur.

Remember:

Designing resilient systems requires balancing between AWS-managed infrastructure and your application's unique configuration and response strategies.

Summary

The AWS Shared Responsibility Model clearly splits roles: AWS manages the underlying infrastructure, while you are accountable for your application's reliability through strategic design in infrastructure, application architecture, change management, and failure management. As the level of managed services increases, direct responsibility for reliability decreases, but understanding these components remains crucial for building resilient systems.

The image illustrates the AWS Shared Responsibility Model, dividing security responsibilities between the customer and AWS. It shows customer responsibilities for security "in" the cloud and AWS responsibilities for security "of" the cloud, including data, applications, infrastructure, and more.

The image is a summary slide explaining the AWS Shared Responsibility Model, highlighting customer and AWS responsibilities, and noting that more managed services reduce customer responsibility.

I'm Michael Forrester. If you have any questions, feel free to reach out at [email protected]. I look forward to connecting with you in the forums, and I hope you found this lesson insightful.

Watch Video

Watch video content