KodeKloud Notes

Welcome to this lesson on chaos engineering with AWS Fault Injection Simulator (FIS). I’m Nasia Ullas, and I’ll guide you through designing, executing, and analyzing fault-injection experiments to strengthen your system resilience.

As modern architectures grow in complexity, unexpected failures can lead to significant downtime costs:

44% of organizations report that 1 hour of downtime costs between $1 million and $5 million.
In 2021, Facebook incurred $80 million+ in losses from seven hours of downtime.
A recent “blue screen of death” outage impacted airlines, banks, healthcare providers, and countless other businesses worldwide.

The image shows a Windows blue screen error message indicating that the PC ran into a problem and needs to restart, with a progress indicator at 5% complete.

Chaos engineering is the practice of intentionally injecting faults into a system to uncover weaknesses and validate its ability to withstand real-world disruptions. In this course, we’ll leverage AWS Fault Injection Simulator (FIS) to conduct controlled experiments in your AWS environment.

Course Outline

We’ll cover seven high-level modules, each focusing on different AWS services and fault types:

Module 1: Basic FIS Experiments
Configure IAM, create experiment templates, execute tests, and monitor results with dashboards.
Module 2: Sample Application & Steady-State Metrics
Deploy a reference application and define baseline performance metrics.
Module 3: Disk Fill Scenario on EC2
Simulate disk saturation on EC2 instances and analyze its impact on application behavior.
Module 4: Aurora Reader Reboot
Inject a reboot fault into an Aurora reader node and observe recovery processes.
Module 5: Fargate Load Stress Test
Apply CPU and memory stress to a serverless Fargate task and evaluate performance under high load.
Module 6: EKS Memory Stress & Pod Deletion
Perform memory saturation tests and pod-deletion experiments in your EKS cluster to validate self-healing.
Module 7: Availability Zone Power Interruption
Simulate a power outage in an entire availability zone to assess multi-AZ resilience.

Conclusion

By the end of this lesson, you’ll have a solid understanding of how to:

Design robust failure scenarios for cloud applications.
Execute controlled experiments safely.
Analyze results to strengthen your system’s resilience.

Let’s get started and build more reliable, fault-tolerant architectures with AWS FIS!

Watch Video

Watch video content

Course Introduction

Course Outline

Conclusion

Further Reading

Watch Video