Welcome back, students. This article, presented by Michael Forrester from KodeKloud, delves into designing for reliability within disaster recovery—a critical topic for both the AWS Solutions Architect exam and real-world implementations. Disaster recovery, often referred to as business continuity, is essential because the ability to recover from service disruptions is at the heart of system reliability. Think of reliability and disaster recovery as peanut butter and chocolate—each enhances the other. Without a strong resiliency and recovery strategy, your workloads become vulnerable when disasters strike.Documentation Index
Fetch the complete documentation index at: https://notes.kodekloud.com/llms.txt
Use this file to discover all available pages before exploring further.

- Recovery Point Objective (RPO): The amount of data you are willing to lose, defined by your backup frequency. Essentially, it answers the question, “At what point in time was your last backup?”
- Recovery Time Objective (RTO): The maximum acceptable time to restore service, incorporating the time needed to restore data and reactivate systems.

Understanding RPO and RTO
- RPO: Answers the question, “How much data can you afford to lose?” For instance, if backups run hourly and it takes four hours to restore a database, even a one-hour data loss can result in extended downtime due to recovery processes.
- RTO: Answers, “How long does it take to fully restore the service?” If it takes several hours to bring a database back online, that period determines your RTO.
RTO focuses on service availability, while RPO is primarily concerned with data loss.


Disaster Recovery Models
Below is an overview of the disaster recovery models arranged from the simplest to the most sophisticated:- Backup and Restore
Your on-premises data is backed up and, during a disaster, restored in the cloud. While straightforward and cost-effective, this method can lead to longer downtimes.

- Pilot Light
This model maintains a minimal version of your environment (typically just the database) in the cloud. In the event of a disaster, you quickly scale up the remaining components (e.g., application and frontend servers). Although recovery is faster than backup and restore, the process might take tens of minutes to fully restore service.

- Warm Standby
In this model, a scaled-down but fully functional version of your production environment runs in the cloud, handling a small portion of production traffic. In a disaster, this standby system scales up quickly to manage full production load.

- Active-Active (Multi-site)
In this strategy, traffic is distributed across two or more active sites (such as multiple regions or a hybrid of on-premises and cloud environments). If one site fails, another takes over instantly with virtually no downtime. This approach offers near-zero RPO (aside from replication delays) and minimal RTO, but it comes with increased complexity and higher costs.




Summary
Disaster recovery planning is an essential component of your overall business continuity strategy. Balancing acceptable downtime (RTO) with acceptable data loss (RPO) while remaining within budget constraints is key. The models discussed—from backup and restore to active-active—define not only the technical approach to disaster recovery but also the associated costs and business impacts.
- RTO determines the allowable downtime.
- RPO determines the maximum acceptable data loss.
- As you progress from backup and restore to active-active, recovery times improve while data loss minimizes—albeit at a higher cost and increased system complexity.