Skip to main content
Welcome to the Advanced Reliability Engineering module. This lesson expands on reliability fundamentals with a practical introduction to Chaos Engineering and the cost of reliability. It’s not an exhaustive catalog of tools or configurations; instead, it focuses on the concepts and practices that matter when operating at scale—how to find hidden failure modes and build repeatable, safe experiments that increase system confidence.
A presentation slide titled "Advanced Reliability Engineering" featuring two panels: "Chaos Engineering — Relevant at large-scale orgs" and "Cost of Reliability — Links engineering decisions to business impact." A caption below reads, "An expansion of fundamentals to broaden your SRE perspective, not daily details."

What is Chaos Engineering?

Traditionally, teams measure and fix systems after incidents occur. Chaos Engineering flips that process: it treats failures as hypotheses to test. Instead of hoping systems will behave under stress, you run controlled experiments—often in production-like environments—to discover weaknesses before customers notice. Large organizations (for example, Netflix and Amazon) use chaos experiments to gain confidence that systems will recover automatically during real outages.
A presentation slide titled "Chaos Engineering" showing a central blue circle and icons/text comparing "System Weaknesses" (failures impact user experience) on the left with "System Confidence" (systems handle outages effectively) on the right. It notes practices like intentionally introducing system failures to identify vulnerabilities before users are affected.
Chaos Engineering is not “breaking things for fun.” It’s a scientific practice: define a hypothesis about system behavior, inject a controlled failure, measure outcomes against success criteria, and iterate. When teams automate experiments and instrument systems well, incident response improves—teams learn which automated remediations and fallback behaviors actually work.
A presentation slide titled "Chaos Engineering" showing a laptop with a gear on its screen and an overlaid warning triangle. To the right are icons labeled "Worst Moments" and "Unexpected Failures," and the caption reads "Your systems WILL FAIL—the question is when."
The reality is simple: your system will fail. Failures often happen at high-impact moments—product launches, peak shopping periods, or off-hours—and the most damaging ones are unpredictable. Chaos Engineering provides a way to stress the system, reveal hidden weaknesses, and build mitigations before those chaotic moments arrive.
A slide titled "Chaos Engineering" that contrasts traditional testing (shown as a fire drill) with chaos engineering (shown as an actual building fire). It uses colored arrows and short text to say traditional testing verifies ideal conditions while chaos engineering validates behavior under real-world chaos.

Resilient testing vs. Chaos Engineering

Resilient testing and chaos engineering complement each other. Use resilient testing to verify expected behaviors, and use chaos engineering to discover unknown failure modes.
PracticeGoalTypical Examples
Resilient testingVerification of known failure modesKill a pod to verify failover; disconnect a DB to test fallback; simulate latency to check timeouts
Chaos EngineeringDiscovery of unknown failuresRandom network partitions under peak load; memory pressure causing DB slowdown; multiple small faults that cascade
A presentation slide titled "Resilience Testing vs Chaos" showing a "Resilience Testing" panel with three bullet points: "Kill a pod to verify failover works," "Disconnect database to check fallback," and "Simulate latency to test timeouts." The slide is © KodeKloud.
Mindset difference:
  • Resilient testing = verification: confirm designed behaviors under known conditions.
  • Chaos Engineering = discovery: introduce unexpected or compound failures to learn about previously unseen modes.
A presentation slide titled "Resilience Testing vs Chaos" highlighting "Chaos Engineering" with bullets: random network partitions during peak traffic, memory pressure with database slowdown, and multiple small failures cascading into big ones. The slide has a purple header and a © KodeKloud copyright notice.
Both approaches are necessary: verification proves your designs, chaos finds the gaps.
A presentation slide contrasting "Resilience Testing" (proves designs work) with "Chaos Engineering" (reveals hidden weaknesses). It notes that Netflix moved from resilience testing to chaos engineering after unpredictable outages and shows the Netflix logo.

How to run chaos experiments safely

A safe rollout follows a gradual progression from low-risk environments to production canaries:
PhaseEnvironmentScope / Risk
Phase 1 — DevelopmentDeveloper machines / local envAffects only developers; lowest risk
Phase 2 — StagingStaging environment that mirrors productionSafe replica for broader tests
Phase 3 — Production canariesSmall subset of production (e.g., 1% of servers/traffic)Minimal user impact if something goes wrong
Phase 4 — Gradual expansionProgressive increase after validating safetyGrow scope only after observability and rollback are proven
Start small, limit risk, and expand only after observing expected behavior.
A presentation slide titled "Running Safe Chaos Experiments" showing a four-step rollout (1: Development environment, 2: Staging environment, 3: Production (canary), 4: Production (expanded)) with brief test actions and risk scopes under each step. A banner reads "Start Small, Think Big" and each step notes the target audience or exposure level (only developers to broader but controlled).
A safe chaos experiment follows a clear, repeatable sequence:
  1. Define a hypothesis (what you expect to happen).
  2. Set the blast radius (who or what is affected).
  3. Establish success criteria and metrics.
  4. Plan rollback and communication paths.
  5. Execute the experiment.
  6. Monitor real-time metrics and alerts.
  7. Evaluate results and iterate.
A slide titled "Running Safe Chaos Experiments" showing a nine-step "Experiment Execution Sequence." It lists steps from defining a hypothesis and setting a blast radius to executing the experiment, monitoring the system, evaluating results, and rolling back if needed.

Guardrails — keep experiments controlled

Common safety measures:
GuardrailRecommendation
Blast radiusKeep it small—common rule of thumb ≤ 10% of capacity
Time limitsRun experiments for minutes, not hours
Kill switchAlways have an immediate rollback or automated cutoff
MonitoringInstrument with clear metrics, dashboards, and alerts
CommunicationNotify stakeholders before, during, and after experiments
A slide titled "Running Safe Chaos Experiments" that lists five experiment safety measures. The measures are Blast Radius, Time Limits, Kill Switch, Monitoring, and Communication, each with a brief description.
Design experiments to be reversible, observable, and time-bounded. If you can’t observe the impact or rollback quickly, pause the experiment until those controls exist.
Never run an experiment without a tested kill switch and real-time monitoring. Unobserved or irreversible tests can create major outages.
These guardrails separate disciplined chaos from reckless disruption. Instrumentation, observability, and communication are prerequisites—not optional extras.

Real-world origin: Netflix and Chaos Monkey

Netflix’s migration to AWS exposed production failures that weren’t reproducible in traditional tests. In response, they created Chaos Monkey, a tool that randomly terminated instances during business hours to force teams to build services that recover automatically.
A slide titled "Real-World Example - Netflix" showing the Chaos Monkey logo in the center with an alert labeled "Production failures." It also shows a "Data centers" icon on the left and an "aws" icon on the right with the caption "Pull the plug."
The cultural shift was significant: engineers began designing services explicitly to tolerate failures rather than being surprised by them. Netflix expanded Chaos Monkey into a broader “Simian Army” that tests instances, networking, security, and data reliability. Over time, chaos experiments helped prevent more outages than they caused.
A horizontal timeline infographic titled "Real-World Example – Netflix: The Results" with colored chevron arrows marking milestones from 2010 to 2020 (e.g., 2010 Chaos Monkey finds failures; 2012 Simian Army expands; 2014 Chaos prevents outages; 2016 survives AWS outage; 2020 99.97% uptime).
Key outcomes from Netflix’s experience:
  1. Chaos reveals problems that traditional testing misses.
  2. Small, controlled failures reduce the risk of large uncontrolled outages.
  3. Regular experiments build engineering confidence—teams stop fear-driven avoidance of production.
  4. The best experiments are invisible to users: customers don’t notice, but systems become more robust.
A presentation slide titled "Real-World Example – Netflix: Key Insights" with a lightbulb graphic. It lists four numbered chaos-engineering takeaways: chaos reveals real problems, small controlled failures prevent big ones, chaos builds engineer confidence, and the best experiments are invisible to users.

Operational principles and maturity

Netflix codified several operational principles to make chaos practical:
  • Run chaos during business hours (failures happen then).
  • Integrate chaos into deployment pipelines.
  • Treat chaos as a continuous practice, not a one-off stunt.
  • Measure everything—without metrics you cannot build confidence.
A slide titled "Real-World Example – Netflix Chaos Principles" showing four numbered cards with guidelines: run chaos during business hours; make chaos part of every deployment; treat chaos as a practice, not an event; and measure everything. The slide includes a © Copyright KodeKloud notice.
Netflix’s chaos maturity model gives teams a path from manual experiments to a resilience-driven culture:
LevelNameDescription
1Manual chaosEngineers trigger failures by hand
2Automated chaosTools inject failures automatically
3Continuous chaosExperiments run as part of daily operations
4Chaos as codeTests are versioned and integrated into CI/CD
5Chaos cultureResilience thinking is embedded across the org
A colorful five-level pyramid titled "Real-World Example – Netflix: Chaos Maturity Model" with tiers labeled from Level 1: Manual chaos (top) down to Level 5: Chaos culture (base). Each layer is a different color and represents increasing chaos-maturity stages.
Advancing maturity means shifting from ad hoc experiments to measurable, repeatable practices that are part of standard engineering workflows.

Summary

Chaos Engineering is a disciplined, scientific practice for discovering failure modes and improving resilience. When done safely—using small blast radii, defined hypotheses, strong observability, and tested rollback plans—chaos experiments become a powerful tool to reduce outage risk and raise operational confidence. A related advanced topic is the trade-off between cost efficiency and reliability, which ties engineering decisions to business impact and requires its own set of practices and measurements.