AWS Solutions Architect Associate Certification

Designing for Reliability

Understanding the Foundations of Reliability in any Design

Welcome, Solutions Architect Associates. In this lesson, we explore how reliability is built into any design—a comprehensive overview that also explains the shared responsibilities between you and AWS.

The Shared Responsibility Model

For the Solutions Architecture Associate exam, it is important to understand that while AWS manages much of the underlying infrastructure, you are responsible for key aspects such as application architecture, change management, and failure response. AWS takes care of foundational services like EC2 (infrastructure as a service) and RDS (platform as a service), allowing you to interact with managed services. For instance, while you can adjust parameters on an RDS instance, services like DynamoDB are fully managed via API calls without direct server access.

The image illustrates the AWS Shared Responsibility Model for reliability, dividing responsibilities between the customer (resilience "in" the cloud) and AWS (resilience "of" the cloud). It highlights areas like infrastructure testing, workload architecture, and AWS global infrastructure components.

When using managed services, AWS assumes most reliability tasks. However, you maintain a crucial role in designing your application's architecture to better mitigate potential failures.

Areas of Focus

Below are four critical areas to consider while designing resilient systems:

  1. Infrastructure Architecture
    Ensure that components like networking, storage, and connectivity are designed with redundancy and resilience in mind. Cloud workloads incorporate service quotas (or service limits) to avoid accidental overuse. Understanding these limits for AWS or third-party services is essential.

    Resistant architectures commonly feature redundant communication paths and efficient IP address management, as outlined in the Well-Architected Framework. For example, AWS storage services often maintain multiple copies of your data. Services like Aurora might store up to six copies to reinforce reliability.

  2. Application (Service) Architecture
    Design your applications to be distributed and resilient. A microservices architecture can ensure that a failure in one component does not bring down the entire system. Define clear service contracts through APIs, SLAs, and SLOs to formalize how components interact.

    Strategies to improve reliability include:

    • Implementing graceful degradation to maintain service during partial failures.
    • Using idempotent operations and rate limiting to prevent cascading issues.
    • Applying automatic retry strategies with exponential backoff to handle temporary service endpoint failures.

    Additionally, incorporating stateless design principles—where session and configuration data are managed externally—allows services to recover independently.

    The image is a slide titled "Foundations of Reliability – Application Architecture," outlining three steps: Segment Your Works, Build Applications, and Provide Service Contracts, each with an icon.

    The image outlines six principles of application architecture for reliability: fail gracefully, throttle requests, limit retry calls, limit queues, make services stateless, and make emergency levers.

  3. Change Management
    Reliable systems rely on robust change management practices. This includes continuous monitoring of system metrics, automatically scaling resources based on demand, and setting up proactive notifications to flag unexpected changes. Regular load testing, isolating changes to individual components, and automating deployment, testing, and rollback processes are key.

    The image outlines the "Foundations of Reliability – Change Management" with seven steps: monitoring components, defining metrics, sending notifications, automated responses, analytics and data display, reviewing data and metrics, and monitoring sessions.

    For example, if your service experiences an unexpected surge, automated systems should detect the shift, scale resources accordingly, and notify the administrators. Periodic reviews and enhancements to your change process will further stabilize your system.

    The image outlines four steps in change management for reliability: enabling/disabling changes, limiting changes to isolated zones, reviewing and improving change capability, and automating change.

  4. Failure Management
    Even with robust AWS infrastructure, you must have a plan to handle application-level failures. This involves regular backups, automated restoration processes, and continuous testing of your disaster recovery plans. Techniques such as Chaos Engineering or "game days" can be effective for simulating failures and verifying recovery strategies.

    The image shows four computer monitor icons in different colors, with the first and last monitors displaying a gear symbol. The text above reads "Foundations of Reliability – Failure Management."

    Proactive testing, autoscaling, load testing, and chaos engineering can dramatically improve the mean time between failures and ensure rapid recovery when issues occur.

Remember:

Designing resilient systems requires balancing between AWS-managed infrastructure and your application's unique configuration and response strategies.

Summary

The AWS Shared Responsibility Model clearly splits roles: AWS manages the underlying infrastructure, while you are accountable for your application's reliability through strategic design in infrastructure, application architecture, change management, and failure management. As the level of managed services increases, direct responsibility for reliability decreases, but understanding these components remains crucial for building resilient systems.

The image illustrates the AWS Shared Responsibility Model, dividing security responsibilities between the customer and AWS. It shows customer responsibilities for security "in" the cloud and AWS responsibilities for security "of" the cloud, including data, applications, infrastructure, and more.

The image is a summary slide explaining the AWS Shared Responsibility Model, highlighting customer and AWS responsibilities, and noting that more managed services reduce customer responsibility.

I'm Michael Forrester. If you have any questions, feel free to reach out at [email protected]. I look forward to connecting with you in the forums, and I hope you found this lesson insightful.

Watch Video

Watch video content

Previous
Models of Disaster Recovery