
- Resilience: the system’s ability to recover from failures—restarting processes, requeuing work, degrading gracefully, or automatically reconciling state.
- Fault tolerance: the system’s ability to continue operating correctly even when parts fail—through redundancy, graceful degradation, and isolation.
- Services lock up or become unresponsive (stuck threads, deadlocks).
- Downstream components become unavailable (network partitions, crashed services).
- Data corruption or race conditions under concurrent load.
- Cascading failures when one overloaded service causes others to fail.
Resilience and fault tolerance are complementary: resilience focuses on recovery and adaptation after failures; fault tolerance focuses on continuing correct operation despite failures. Designing both into your system reduces downtime and limits user impact when things go wrong.
| Strategy | Purpose | Typical implementation / example |
|---|---|---|
| Timeouts | Fail fast when operations take too long | Configure client/server timeouts (HTTP, DB) to avoid blocking threads |
| Retries with exponential backoff | Recover from transient failures without overwhelming services | Retry on 5xx or network errors with jittered exponential backoff |
| Circuit breakers | Prevent repeated calls to failing downstream services | Track error rates, open circuit after threshold, probe for recovery |
| Bulkheads | Isolate failures to a subset of resources | Separate thread pools/queues per downstream dependency or tenant |
| Rate limiting / Throttling | Protect services from overload | API gateway limits requests per client/IP or globally |
| Idempotency & safe retries | Ensure repeated attempts don’t cause incorrect side effects | Use idempotency keys, transactional semantics, or record deduplication |
| Monitoring & alerting | Detect failure modes early and enable response | Instrument metrics, traces, logs; create alerts for SLO/SLA breaches |
- Combine patterns: use timeouts + retries + circuit breakers for robust remote calls.
- Make retries safe: design APIs and operations to be idempotent or use deduplication keys.
- Use bulkheads to prevent a failing dependency from exhausting shared resources (threads, connections).
- Add observability: metrics, distributed tracing, and structured logs are essential for diagnosing intermittent issues.
- Test with chaos experiments (chaos engineering) and load testing to reveal behavior under failure scenarios.
- Design for graceful degradation: return reduced functionality rather than complete failure when components degrade.
Be careful with naive retries: aggressive retries can amplify load and cause cascading failures. Always use backoff, jitter, and circuit breakers—and ensure operations are idempotent when possible.
- Timeouts: always for external IO (HTTP, DB, RPC).
- Retries: for transient network or service errors, not for persistent failures.
- Circuit breakers: when a downstream service has intermittent high latency or error spikes.
- Bulkheads: in multi-tenant systems or when a single dependency can monopolize resources.
- Rate limiting: at service boundaries (API gateways) and for abusive clients.
- Kubernetes Documentation — reliability patterns for distributed systems.
- Circuit Breaker pattern (Martin Fowler) — conceptual guide.
- Designing Data-Intensive Applications — patterns for resilience and fault tolerance.
- Resilience4j — modern Java library implementing circuit breakers, bulkheads, retries.