Configuration Management

Hello again and welcome back. In this lesson we’ll explore configuration management — why it matters, common failure modes, and practical ways to manage configuration safely and reliably. Code tends to get the spotlight, but many major outages originate from configuration mistakes. When treated correctly, configuration becomes just another form of code: reviewable, testable, and auditable. Why does configuration management matter? Configuration is often the silent reliability risk. Studies and postmortems show a large percentage of outages are caused by configuration errors. These issues are harder to debug, often skipped in automated tests, and can have immediate, widespread impact.

These examples are not edge cases — they show how a single misconfiguration can cascade into large-scale outages. Configuration changes are deceptively risky because they frequently skip code review and automated testing. Engineers may push config updates directly to production hoping “nothing breaks.” One bad setting can take down a service instantly; a misconfigured access control can even lock you out of the systems needed to restore service. For SREs, misconfigurations are a common and recurring incident source: time spent debugging config issues often exceeds time spent debugging application code. That’s why configuration deserves the same rigor as application code: version control, review, CI, testing, and traceability.

A presentation slide titled "Why Configuration Management Matters" that highlights "Why Configuration Is Dangerous." It shows three numbered points with icons: 01) Config often skips review, 02) Can break everything instantly, and 03) Can lock you out of fixes.

Common configuration problems follow a few repeatable patterns:

Problem	What happens	Typical consequence
Manual config drift	One-off fixes applied directly in production	Divergent server states, hard-to-reproduce bugs
Sensitive data exposure	Secrets committed or stored in plaintext	Leak of credentials or keys
Environment inconsistency	Different settings across dev/staging/prod	Bugs that only appear in production
Uncontrolled changes	Repeated tweaks by many engineers	No clear owner or source of truth for live state

Common pitfalls include:

Manual config drift: a 3 AM production patch that never propagates.
Sensitive data exposure: credentials accidentally committed or left in public files.
Environment inconsistency: dev/staging/prod mismatch causing unexpected behavior.
Uncontrolled changes: multiple engineers iterating until no one knows the canonical state.

Examples of risky config material:

# config.py
# (accidentally public!)
DATABASE_PASSWORD = "super_secret_123"
AWS_ACCESS_KEY = "AKIAIOSFODNN7EXAMPLE"

Environment inconsistency example (three environment documents):

# Dev
connections: 5
cache_gb: 1
debug: true
---
# Staging
connections: 10
cache_gb: 2
debug: false
---
# Production (incorrectly left with debug=true)
connections: 10
cache_gb: 2
debug: true

These pitfalls make systems fragile and often surface during incidents.

A slide titled "Common Pitfalls" that lists four warnings: Manual Config Drift, Sensitive Data Exposure, Environment Inconsistency, and Uncontrolled Changes. Below is a table showing daily config changes (Alice, Bob, Charlie) that overwrite timeouts and lead to a production incident.

Treat configuration as code: store configuration files, dashboards, and environment-specific variables in version control so every change is reviewed, tested, and auditable. Example repository layout (KodeKloud example):

kodekloud-records-store-web-app
  config/monitoring
    grafana-provisioning
      dashboards
        dashboard.yaml
        end-to-end-purchase-journey.json
        engineer-dashboard.json
        executive-dashboard.json
        kodekloud-records-store-slis.json
        kodekloud-records-store-slos.json
        observability-dashboard.json
        performance-metrics-dashboard.json
      datasources
    logging
    alert_rules.yml
    alertmanager.yml
    blackbox.yml
    prometheus.yml
    README.md
    sli_rules.yml

.github
config
deploy
  environments
  templates
    env.dev.template
    env.prod.template
    env.staging.template

With this layout, configuration changes travel through the same review and CI pipeline as code changes. Safe change management techniques

Infrastructure as Code (IaC): keep environment and infrastructure definitions in Git to enforce review and reproducibility.
Feature flags: ship code disabled, then toggle behavior at runtime for quick rollback and safe experimentation.
Gradual rollouts: expose new configuration to a small subset of users, monitor metrics, and expand progressively.
Version control for every config change: maintain provenance (who changed what and when) to enable rapid rollback.

# Example feature flag usage
if feature_flag("new_payment_processor"):
    return new_payment_flow()
else:
    return old_payment_flow()

A presentation slide titled "Safe Change Techniques" listing four practices: Infrastructure as Code, Feature Flags, Gradual Rollouts, and Configuration Version Control, with checkmarks beside the first two. A callout shows a rollout strategy (10% → Monitor → 50% → Monitor → 100%) and a note to rollback if metrics degrade.

Use staged environment promotion. Changes should flow from development to staging and then to production with guardrails at each stage: config validation, build and test in dev, container/image scanning, integration tests, and manual approvals when warranted.

Staged promotions ensure that by the time changes hit production, they have been validated in realistic environments and checked at multiple gates.

A simple CI pipeline that enforces environment promotion (GitHub Actions-style example):

jobs:
  validate-configuration:
    runs-on: ubuntu-latest

  build-and-test:
    runs-on: ubuntu-latest
    needs: validate-configuration

  build-container:
    runs-on: ubuntu-latest
    needs: [validate-configuration, build-and-test]

  deploy-staging:
    runs-on: ubuntu-latest
    needs: [build-container]
    environment: staging

  deploy-production:
    runs-on: ubuntu-latest
    needs: [deploy-staging]
    environment: production

Each job depends on earlier gates. Start with configuration validation, then build and test in dev. After successful builds and scans, deploy to staging; only after staging gates pass should the pipeline promote to production. Environment-specific configuration keeps deployments flexible and safe. For example, Docker Compose can load environment variables from files per environment so you can test in realistic conditions without mixing values across environments. docker-compose service example:

version: "3.9"
services:
  db:
    image: postgres:15
    container_name: kodekloud-record-store-db
    restart: always
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}

Store your .env templates under deploy/templates (as shown earlier) and select the appropriate environment file when launching Compose:

# Run with development variables
docker-compose --env-file .env.dev up

# Run with staging variables
docker-compose --env-file .env.staging up

# Run with production variables
docker-compose --env-file .env.prod up

Summary

Treat configuration as code: version, review, test, and audit.
Use feature flags and gradual rollouts to reduce blast radius.
Enforce staged promotions and pipeline gates for any change that affects production.
Keep environment-specific values isolated and never check secrets into source control.

Links and references

That wraps up this lesson on configuration management: why it matters, where it tends to go wrong, and practical patterns to manage configuration safely. A subsequent lesson will cover secure software releases.

Watch Video