Chaos Engineering

Chaos Engineering on Availability Zone

Demo General Experiment Setup AZ

In this tutorial, you’ll prepare an AWS Fault Injection Simulator (FIS) experiment that simulates an Availability Zone (AZ) power outage. We’ll walk through creating the IAM role, defining trust and task policies, and attaching them so FIS can safely perform the experiment.

Prerequisites:

  • AWS CLI installed and configured
  • Permissions to create IAM roles and policies
  • A Linux or MacOS terminal

1. Change to the AZ experiment directory

Navigate to the folder containing the experiment files:

cd ~/environment/workshopfiles/fis-workshop/az-experiment
ls -l

You should see:

FileDescription
az-impairment-dashboard.yamlCloudWatch dashboard for impairment metrics
disable-enable-cross-zone-alb.shScript to toggle cross-zone ALB settings
fis-az-experiment-policy.jsonTrust policy for FIS role
fis-az-experiment-template.jsonFIS experiment template
fis-az-task-policy.jsonTask policy defining FIS permissions
restart-user-traffic.shScript to restart user traffic
shift-traffic-out-of-az-all-albs.shShift traffic out of AZ for all ALBs
shift-traffic-out-of-az-one-alb.shShift traffic out of AZ for one ALB
withzonalshift.shScript to run full zonal-shift

2. Define the trust policy

Open fis-az-experiment-policy.json and verify it grants fis.amazonaws.com permission to assume the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": ["fis.amazonaws.com"] },
      "Action": "sts:AssumeRole"
    }
  ]
}

This trust policy ensures only AWS FIS can assume the IAM role.

3. Create the IAM role

Run the following command to create the fis-az-role with the above trust policy:

aws iam create-role \
  --role-name fis-az-role \
  --assume-role-policy-document file://fis-az-experiment-policy.json

Expected response:

{
  "Role": {
    "Path": "/",
    "RoleName": "fis-az-role",
    "Arn": "arn:aws:iam::123456789012:role/fis-az-role",
    "CreateDate": "2024-08-05T05:24:20+00:00",
    "AssumeRolePolicyDocument": { /* truncated */ }
  }
}

4. Review the FIS task policy

Next, review fis-az-task-policy.json, which grants FIS permissions to perform AZ failover operations. Key sections include:

  1. Auto Scaling group

    {
      "Sid": "DescribeAsg",
      "Effect": "Allow",
      "Action": ["autoscaling:DescribeAutoScalingGroups"],
      "Resource": ["*"]
    }
    
  2. Network ACL management

    [
      {
        "Effect": "Allow",
        "Action": "ec2:CreateNetworkAcl",
        "Resource": "arn:aws:ec2:*:*:network-acl/*",
        "Condition": { "StringEquals": { "aws:RequestTag/managedByFIS": "true" } }
      },
      {
        "Effect": "Allow",
        "Action": ["ec2:CreateNetworkAclEntry","ec2:DeleteNetworkAcl"],
        "Resource": ["arn:aws:ec2:*:*:network-acl/*","arn:aws:ec2:*:*:vpc/*"],
        "Condition": { "StringEquals": { "ec2:ResourceTag/managedByFIS": "true" } }
      }
    ]
    
  3. RDS failover and reboot

    [
      {
        "Effect": "Allow",
        "Action": ["rds:FailoverDBCluster"],
        "Resource": ["arn:aws:rds:*:*:cluster:*"]
      },
      {
        "Effect": "Allow",
        "Action": ["rds:RebootDBInstance"],
        "Resource": ["arn:aws:rds:*:*:db:*"]
      }
    ]
    
  4. ElastiCache AZ power interruption

    {
      "Effect": "Allow",
      "Action": ["elasticache:DescribeReplicationGroups","elasticache:InterruptClusterAzPower"],
      "Resource": ["arn:aws:elasticache:*:*:replicationgroup:*"]
    }
    

Why these permissions?

These permissions let FIS safely describe and modify Auto Scaling groups, manage network ACLs tagged for FIS use, fail over RDS clusters, reboot instances, and interrupt power in ElastiCache clusters.

5. Attach the task policy to the IAM role

Attach fis-az-task-policy.json to the fis-az-role so it can execute the experiment actions:

aws iam put-role-policy \
  --role-name fis-az-role \
  --policy-name fis-az-task-policy \
  --policy-document file://fis-az-task-policy.json

With the IAM role and policies in place, you’re now ready to launch the FIS experiment that simulates an AZ power interruption.


Next, proceed to Implement the FIS Experiment to launch the test.

Watch Video

Watch video content

Previous
Experiment Overview