Guide to running an AWS Fault Injection Service experiment triggering RDS failover while monitoring CloudWatch and X‑Ray to validate multi‑AZ resilience and recovery
Before injecting faults, we first establish the steady state so we can compare behavior during and after the experiment.Start a browser-based load test that issues requests to the pet site and let it run for about 10 minutes. The test writes logs to the console; a representative excerpt looks like this:
Copy
time="2024-08-18T08:24:56Z" level=error msg="Uncaught (in promise) navigating frame to \"http:///\": Cannot navigate to invalid URL (-3)" executor=constant-vus scenario=browserrunning (00m13.0s), 4/4 VUs, 93 complete and 0 interrupted iterationsbrowser [ 1 ] 4 VUs 00m13.0s/16m39stime="2024-08-18T08:24:58Z" level=error msg="Uncaught (in promise) navigating frame to \"http:///\": Cannot navigate to invalid URL (-3)" executor=constant-vus scenario=browser...
After roughly 10 minutes the run summary shows the test in steady progress:
Copy
time="2024-08-18T08:35:36Z" level=error msg="Uncaught (in promise) navigating frame to \"http:///\": Cannot navigate to invalid URL (-3 2000)" executor=constant-vus scenario=browserrunning (10m52.0s), 4/4 VUs, 4643 complete and 0 interrupted iterationsbrowser [ 65% ] 4 VUs 10m52.0s/16m39s...running (10m53.0s), 4/4 VUs, 4651 complete and 0 interrupted iterations
With the load running and steady, inspect application telemetry and capture steady-state values for the key observability signals:
X-Ray trace map (PGSQL query): latency under 1 ms and ~139 requests/min (10-minute window).
Custom CloudWatch dashboard (easy impairment dashboard): 15-minute view shows zero UnHealthyHostCount and all DB connections pointing to the writer node.
We’ve captured steady-state metrics; next we run the easy power-interruption experiment using AWS Fault Injection Service (FIS).From the FIS experiment template, generate a preview of targets. The preview lists the actual resources the experiment may affect — use it to validate scope and plan mitigation.
Use the preview to confirm which resources will be targeted and to ensure you’re prepared for the potential impact before starting the experiment.
The preview includes target details such as the custom role, auto scaling groups, EC2 instances, RDS cluster, and subnets:
Start the experiment. The FIS action summary lists each planned step and its status (for example: Failover-RDS, Pause-ASG-Scaling, Pause-ElastiCache, etc.):
Some templates include an ElastiCache pause step. If your environment does not use ElastiCache, FIS skips that action and continues. Verify the preview to confirm resource availability before execution.
As the experiment runs, RDS failover is triggered: writer and reader roles begin to swap. Re-check the same metrics you used for steady state during the disruption window:
X-Ray trace map (10-minute window): you should observe a drop in requests during the failover window.
CloudWatch easy impairment dashboard (10-minute view): you will likely see RDS connection shifts and a transient increase in UnHealthyHostCount.
Monitor CloudWatch widgets while the experiment runs:
Despite telemetry changes, the website can remain functional for end users (for example, completing UI flows like adopting pets) because of a resilient multi-AZ architecture and proper failover behavior. Controlled fault-injection tests are a repeatable alternative to an ad-hoc disaster recovery drill and can reveal configuration issues such as AZ-specific dependencies.After the experiment finishes, verify recovery across observability tools:
X-Ray trace map returns to normal request volume and latency.
RDS cluster shows that roles have been swapped and the cluster is healthy.
CloudWatch metrics return to steady-state levels.
Check trace-level details to confirm request patterns recovered:
RDS console confirms instance roles and health:
The Database Connections chart clearly shows the role swap: the writer node’s connections drop while the reader node’s connections increase briefly, then both normalize:
Key metrics and checks to capture (before, during, after)
Metric / Signal
Why it matters
Where to check
Request rate & error rate
Detect user impact and recovery
Load test logs, CloudWatch ALB metrics
DB request latency & counts
Detect DB performance and role changes
X-Ray trace map, CloudWatch RDS metrics
RDS connections per instance
Confirm failover and connection redistribution
CloudWatch RDS Database Connections chart
UnHealthyHostCount & ALB 5XX
Identify transient backend failures
CloudWatch easy impairment dashboard
Resource targets
Ensure correct scope and mitigate impact
FIS preview ARNs and resource list
Summary
Captured a steady-state baseline (low latency, stable request rates).
Previewed and validated FIS experiment targets before running the fault-injection template.
The experiment caused a short RDS failover and temporary telemetry perturbations (reduced requests, brief UnHealthyHostCount spike), but the site remained functional thanks to multi-AZ resilience.
All metrics returned to normal after the experiment, confirming automatic failover and recovery.