In this final section of our security architecture guide, we dive into resilience and disaster recovery. This guide covers critical topics such as high availability, disaster recovery planning, and leveraging cloud-based tools to ensure overall business continuity.Documentation Index
Fetch the complete documentation index at: https://notes.kodekloud.com/llms.txt
Use this file to discover all available pages before exploring further.

High Availability
High availability is achieved by constructing an architecture that minimizes downtime through fault tolerance and redundancy across servers, routers, switches, and data centers.

- 99.999% uptime (“five nines”) permits approximately 5 minutes and 15 seconds of downtime per year.
- 99.99% uptime (“four nines”) allows roughly 52 minutes and 34 seconds of downtime annually.

When designing for high availability, consider the impact on the attack surface. Introducing redundancy can increase potential vulnerability areas, so ensure that the same security controls protecting your primary systems are equally applied to redundant resources.


Scalability, Fault Tolerance, and Redundancy
Scalability denotes a system’s capability to dynamically adjust its resources to match fluctuating demands. For instance, during peak times when customer requests surge, the system should seamlessly allocate additional resources. This flexibility is commonly referred to as elasticity.
Disaster Recovery Sites
Disaster recovery planning involves identifying and preparing backup sites that can take over in the event of a major incident. There are three primary types of disaster recovery sites:-
Cold Site:
A location with no pre-installed equipment. All necessary components must be transported and configured during an emergency. This option typically incurs the longest recovery time. -
Warm Site:
A site that comes with some pre-installed equipment that can be activated when a disaster strikes, offering faster recovery than a cold site. -
Hot Site:
A fully operational site that runs concurrently with the primary site. In a crisis, either site can assume full workload responsibilities without delay.
