Explains Infrastructure as Code for SREs, covering tools, best practices, CI/CD, policy as code, state management, drift detection, and automating AWS IAM using Terraform.
Welcome back. This lesson covers Infrastructure as Code (IaC): the practice of defining and managing infrastructure using the same engineering workflows as application code — version control, code review, automated testing, and repeatable deployments. IaC transforms provisioning from a manual, error-prone activity into a reliable, auditable, and testable process.In the past, provisioning meant SSHing into machines, editing configs by hand, restarting services, and hoping nothing broke. That produced infrastructure drift, inconsistent environments, and no reliable history of changes. IaC flips that model: you declare the desired state in files, keep them in Git, review changes via pull requests, preview their effects, and apply them automatically. The result: consistent environments, auditable change history, and the ability to test updates before they reach production.
Why SRE teams adopt IaC
No more snowflake servers — the same code builds identical systems.
Predictable disaster recovery: rebuild full environments from code.
Full, auditable change tracking — know who changed what and when.
Test infrastructure updates before they reach production to avoid surprises and late-night incidents.
Core IaC tools
Tool
Purpose
Notes
Terraform
Multi-cloud provisioning via declarative HCL
Widely used, supports modules and backends
AWS CloudFormation
AWS-native infrastructure as code
Deep AWS integration; YAML/JSON templates
Pulumi
Code-first IaC using familiar languages
Use TypeScript, Python, Go, etc.
IaC best practices
Version control everything — if it’s not in Git, it’s not managed.
Use variables; avoid hard-coded values to make configs reusable.
Embrace modules and DRY principles to prevent copy-paste errors.
Treat infrastructure changes like app changes: require pull requests, preview plans, and run automated checks.
Always keep your infrastructure code as the single source of truth. Use pull requests and plan previews so reviewers can see intended changes before apply.
Example: keep all infrastructure files in Git (good) and avoid making manual changes directly in the cloud console (bad).
Copy
# Good: All infrastructure code in Gitgit add main.tf variables.tfgit commit -m "Add production database with backup retention"# Bad: Making manual changes through the AWS console# (Those are not captured in version control)
Policy as code
Policy-as-code tools such as Open Policy Agent (OPA) and HashiCorp Sentinel let you encode guardrails (for example: “no public S3 buckets,” “EC2 instances must have a backup tag,” or “production changes require explicit approval”) and evaluate them automatically as part of CI.
Real-world example: automating AWS IAM user creation
Manually creating many IAM users is slow and error-prone — people forget MFA, skip tags, and introduce inconsistencies. Instead, define users in code and let Terraform create them consistently.
Define a simple list of usernames in terraform.tfvars:
Note: the managed policy IAMUserChangePassword permits users to change their own passwords. Enforcing MFA is a separate concern — you can require MFA via IAM conditions (aws:MultiFactorAuthPresent), AWS Organizations SCPs, or other account-level controls. Choose an MFA strategy that matches your organization’s security posture.Module internals — example implementation
This module loops over usernames, creates users and access keys, attaches managed policies, and creates an inline policy if provided:
Copy
# modules/iam-user/main.tfresource "aws_iam_user" "new_users" { for_each = toset(var.iam_usernames) name = each.value path = "/" # If these users might be pre-existing and you don't want Terraform to modify them, # ignore lifecycle changes that would attempt updates. lifecycle { ignore_changes = all }}resource "aws_iam_access_key" "user_keys" { for_each = aws_iam_user.new_users user = each.value.name}resource "aws_iam_user_policy_attachment" "user_policy_attachments" { for_each = { for pair in setproduct(keys(aws_iam_user.new_users), var.managed_policy_arns) : "${pair[0]}-${pair[1]}" => { user = pair[0] policy_arn = pair[1] } } user = aws_iam_user.new_users[each.value.user].name policy_arn = each.value.policy_arn}resource "aws_iam_user_policy" "inline_policy" { for_each = var.inline_policy_document != null ? aws_iam_user.new_users : {} name = "${each.value.name}-inline-policy" user = each.value.name policy = var.inline_policy_document}
Working with a real repo and CI/CD
Typical workflow with a sample repo (kodekloud-records-terraform-infrastructure):
Fork and clone the repo.
Create a feature branch (for example: dev-test).
Edit variables (like terraform.tfvars) to add the IAM users you want.
Run terraform plan locally or rely on CI to preview changes.
Commit and push to trigger GitHub Actions which run Terraform.
Project layout example — you’ll typically see directories for environments, modules, and workflows.
Create and switch to a branch locally:
Copy
# From repository rootgit checkout -b dev-test
Backend and state handling
Remote state storage (for example, S3) is common for Terraform. If you create the backend bucket after your first apply, reconfigure with:
Copy
terraform init -reconfigure
Example S3 backend and state bucket (abbreviated):
If you pre-create the S3 bucket in the console for demos, the repo backend can point to that bucket. For production deployments, prefer creating and configuring state storage via IaC when possible.
CI/CD and secrets
When running Terraform from GitHub Actions, supply AWS credentials via repository secrets or use OIDC-based workflows. Example apply workflow that runs on pushes to selected branches:
Set repository secrets in GitHub: Settings → Secrets and variables → Actions.
Never store secrets in plaintext within the repository. Prefer repository secrets, environment-level secrets, or OIDC for short-lived credentials.
Triggering workflows
Push your branch to trigger apply workflows. Monitor runs in the Actions tab and inspect logs if anything fails.
Successful apply example
Cleanup workflows
Provide a safe destroy workflow that requires explicit confirmation (for example, typing the word “destroy”) to avoid accidental destruction of resources.
Infrastructure drift
Drift happens when live resources diverge from IaC configurations (for example, someone makes a one-off console change). Detect drift by running:
Copy
# Preview changesterraform plan# Detect configuration drift (exit codes indicate differences)terraform plan -detailed-exitcodeecho "Exit code: $?"# Exit codes:# 0 → No changes# 1 → Error# 2 → Changes present (e.g., drift or a planned change)
When you detect drift, you can:
Accept and codify the manual change into IaC (update the code and commit).
Revert the manual change by applying the IaC.
Import the manual resource into Terraform state so it becomes managed:
Monitoring, alerts, and policy-as-code can help reduce manual modifications to live resources.Example IAM policy JSON snippet used for monitoring permissions:
Wrap-up
Infrastructure as Code is a fundamental SRE practice. It reduces manual toil, enforces consistency, and brings infrastructure under the same engineering controls as application code. Keep your Git repository as the source of truth, add automated validation and policy checks, and monitor for drift so code and live infrastructure stay aligned.Configuration management — managing software and system settings on provisioned infrastructure — is closely related and often complements IaC.Links and references