Chaos Engineering
Chaos Engineering Fundamentals
What is Chaos Engineering
Chaos engineering is the discipline of running controlled experiments to understand how systems behave under failure conditions. By intentionally injecting faults, teams can uncover hidden weaknesses and improve resilience. This lesson walks through the five fundamental steps of chaos engineering, illustrated with diagrams and real-world examples.
The Five Key Steps
Collect Metrics
Establish baseline measurements that represent your system’s normal (steady state) behavior.Form a Hypothesis
Predict how the system will react when a specific fault is introduced, based on your steady state.Design the Experiment
Define the smallest, most targeted test that can validate or refute your hypothesis.Inject Failure
Execute the experiment by introducing the planned disruption.Warning
Always run chaos experiments in a safe, isolated environment and ensure you have monitoring and rollback plans in place.
Measure Impact
Compare post-failure metrics against your baseline to determine whether the hypothesis holds. Use findings to enhance system robustness.
Analogy: States of Water
To make these concepts concrete, consider how water changes state with temperature:
- Given: Water exists as vapor, liquid, or solid depending on temperature.
- Hypothesis: Placing liquid water in a freezer for 10 minutes will cause it to freeze.
Experiment: We put a container of water in the freezer for 10 minutes.
Result: After 10 minutes, the water remains liquid because the freezer’s temperature was higher than expected.
Refinement: We lower the freezer temperature and repeat the test. The water freezes within 10 minutes, validating our updated hypothesis.
Note
Refining your experiment parameters is key to isolating root causes and achieving reliable results.
Technical Example: Auto Scaling Group
Next, let’s apply the five steps to a cloud infrastructure scenario:
- Given: An application runs on a single EC2 instance within an Auto Scaling group (ASG), which maintains a minimum of one instance.
- Hypothesis: Terminating the instance won’t affect availability because the ASG will launch a replacement immediately.
Inject Failure: We terminate the running instance.
Observation: The ASG replaces the instance, but boot time takes 15 minutes—resulting in unexpected downtime.
Refinement: Increase the ASG’s desired capacity to two instances so one remains available during boot.
Rerunning the experiment confirms zero downtime, validating our updated hypothesis and architecture.
Next Steps
You’ve now seen how chaos engineering uncovers hidden weaknesses and drives iterative improvements. In the upcoming sections, we will explore how to implement these experiments using AWS Fault Injection Simulator (FIS) to automate fault injection and monitoring in your cloud environment.
References
Watch Video
Watch video content