How to Plan Your Experiment Part 2

In Part 1, you defined objectives, selected workloads, and established a performance baseline. Now, we’ll guide you through the final steps—hypothesis creation, experiment design, execution, and analysis—so you can confidently run your game day or Fault Injection Simulation (FIS) experiment.

6. Create Your Hypothesis

A well-defined hypothesis clarifies what you expect to happen when a fault is injected. To formulate it:

Identify the affected components
Pinpoint services, instances, or containers targeted by your fault injection.
Describe the expected behavior
Determine how your application should respond under fault conditions.
Define success metrics
Choose key indicators—latency, error rate, throughput—to validate resilience.

Note

A precise hypothesis narrows your experiment’s scope and sets clear success criteria.

7. Design the Experiment

Use AWS FIS to control scope, duration, and safety checks. Configure the following:

Configuration	Description	Example
Target Resources	Apply tags to focus your fault injection on specific AWS resources.	Tag EC2 instances with `env=staging`.
Duration	Specify how long the fault remains active before auto rollback.	`PT5M` (5 minutes)
Stop Conditions	Define thresholds to abort the experiment if they’re violated.	CPU > 80% for 2 minutes

These settings help you limit blast radius and maintain control throughout your test.

8. Run the Experiment

Start in lower environments
Validate your hypothesis in development or staging before touching production.
Note
Always begin in a non-production account or VPC to avoid unintended impact.
Validate resilience
Monitor your application as the fault is injected. Check dashboards and alerts to ensure behavior aligns with your hypothesis.
Promote to production
Once confirmed, rerun the experiment against production workloads with the same configuration.
Mark success
A successful run demonstrates that your architecture can withstand the injected fault without violating SLAs.

9. Conduct a Post-Mortem

A structured post-mortem transforms insights into improvements:

Step	Action
Analyze Impact	Review logs, metrics, traces, and user experience during the experiment.
Blameless Review	Host a session focused on learning, not finger-pointing.
Document Findings	Update runbooks, architecture diagrams, and automation scripts based on lessons learned.
CI/CD Integration	Automate FIS experiments in your CI/CD pipeline to continuously validate resilience.

Warning

Maintain a blameless culture in your post-mortems to encourage transparent learning and innovation.

The image outlines an eight-step process for planning an experiment, including defining objectives, choosing workloads, and conducting a postmortem. It also highlights the importance of analyzing impact and addressing issues.

Links and References

Watch Video

Watch video content