Advocating simplicity in system design to reduce operational complexity, risk, and toil by measuring, justifying, and minimizing unnecessary dependencies and abstractions for more reliable, operable systems.
Earlier we focused on service level objectives—how to define them, measure them, and track system health over time. Meeting your SLOs consistently is not only about monitoring and alerts; it’s also about managing the complexity, risk, and operational effort needed to keep systems running. As systems grow, they become harder to operate: features ship faster, dependencies accumulate, and infrastructure becomes more distributed. That increases risk, operational toil, and fragility.This lesson explains why simplicity in system design matters, how complexity accumulates, and pragmatic strategies to keep systems understandable, operable, and reliable.
Simplicity is not optional. Treat it as a measurable design objective: prefer solutions that are easier to understand, operate, and repair.
A short story: imagine your team is debugging a production outage in a system with five layers, Kubernetes, a service mesh, and multiple asynchronous job pipelines. Someone asks, “Why did we build it this way?” Often the only answer is silence. In SRE, system complexity is one of the greatest threats to reliability. The first rule: verify that the complexity is necessary.
Why complexity matters
Complexity increases attack surface and failure modes, raises cognitive load during incidents, and slows delivery. Each unnecessary abstraction or dependency is a lasting “complexity tax” that teams pay in slower troubleshooting, more manual work, and increased on-call stress.
What drives complexity?
Many common engineering patterns unintentionally increase complexity. The table below maps common drivers to their practical effects so teams can spot them early.
Driver
Typical effect
Feature accumulation without cleanup
Old, unmaintained paths increase failure modes and cognitive load
Premature optimization
Adds architectural complexity before requirements are clear
Distributed architectures with poor boundaries
More services to coordinate, harder debugging and ownership
Stacking many technologies unnecessarily
Harder to operate, maintain, and onboard new engineers
Resume-driven development
Selecting tech to impress rather than to solve the problem
Real-world consequences
Complex systems don’t just complicate engineering life—they can cause catastrophic business outcomes:
2012: Knight Capital’s legacy, tangled code paths caused an algorithm to trigger old trading logic. Uncontrolled trades produced a $440 million loss in 45 minutes and nearly bankrupted the company—an extreme example of brittle complexity causing huge financial damage.
2011: Target’s Black Friday outage was driven by tightly coupled, complex systems. During peak traffic they couldn’t quickly identify or fix the root cause, costing an estimated $61 million in lost sales and damaging customer trust.
2021: A complex networking dependency at a major social platform led to a global outage that impacted thousands of microservices. The incident slowed deployments, complicated incident response, and pushed the company to clarify service ownership, roll some microservices back into larger units, invest in dependency-visualization tooling, and re-evaluate trends like applying the same patterns to AI/ML workloads.
Resume-driven development: a common trap
Warning signs you’re over-engineering include choosing Kubernetes for a tiny app, adopting microservices because a scale company uses them, or rebuilding established components instead of using proven tools. Avoid selecting technology for resumes—pick the simplest tool that meets requirements.
Make simplification a measurable objective
Senior engineers are often most respected for simplifying systems. Removing 1,000 lines of brittle code can be more valuable than adding new features. Use a practical simplicity hierarchy when deciding how to approach problems:
Step
Action
1
Eliminate the problem entirely when possible
2
Use an existing, proven solution (don’t reinvent wheels)
3
Simplify the solution significantly (fewer moving parts)
4
If complexity remains, require clear business justification
Always start by asking whether the problem itself can be removed before designing a new solution.
Red flags that demand simplification
If any of the following are true, prioritize reducing complexity:
Engineers can’t explain the system with a few clear sentences.
Small changes require coordinating many teams and components.
Regular changes produce unexpected side effects.
Dependency chains are deep and hard to trace.
When is complexity justified?
Some domains are inherently complex. Keep complexity only when it is essential, well-contained, and justified by clear business value.
Condition
What to verify
Inherent domain complexity
The problem genuinely requires complex logic (e.g., distributed consensus)
Contained boundaries
Complexity is isolated behind clear interfaces and ownership
High business value
Complexity directly enables revenue or critical features
Proven & operational
The approach is stable, monitored, and operationally manageable
Prefer essential complexity (inherent to the problem) over accidental complexity (introduced by over-engineering).
Summary and next steps
Simplicity in system design reduces fragility, accelerates incident response, and lowers operational cost. Make simplicity an explicit design goal—measure it, incentivize it, and enforce it during architecture reviews. The next lesson will dive into dependency management: how to spot risky dependencies, visualize them, and manage them to reduce fragility and operational overhead.References and further reading