Skip to main content
Okay. We’ve completed some initial project analysis.
A presentation slide titled "Resilience & Fault Tolerance" with a large "Demo" label set on a dark curved shape to the right. There’s a small copyright notice for KodeKloud in the corner.
Key concepts covered in this lesson include design patterns, SOLID principles for composing systems, and error/exception flow analysis. These help you predict and design for what happens when individual components fail, become slow, or behave incorrectly. Definitions
  • Resilience: the system’s ability to recover from failures—restarting processes, requeuing work, degrading gracefully, or automatically reconciling state.
  • Fault tolerance: the system’s ability to continue operating correctly even when parts fail—through redundancy, graceful degradation, and isolation.
Common failure modes to consider
  • Services lock up or become unresponsive (stuck threads, deadlocks).
  • Downstream components become unavailable (network partitions, crashed services).
  • Data corruption or race conditions under concurrent load.
  • Cascading failures when one overloaded service causes others to fail.
A simple sign of an interactive/hanging process is a stalled shell or prompt:
jeremy@MACSTUDIO Express-login-demo %
A prompt like this that never progresses suggests a user-facing process is blocked or awaiting I/O—one of the symptoms resilience and fault-tolerance techniques aim to catch and mitigate.
Resilience and fault tolerance are complementary: resilience focuses on recovery and adaptation after failures; fault tolerance focuses on continuing correct operation despite failures. Designing both into your system reduces downtime and limits user impact when things go wrong.
Core strategies to improve resilience and fault tolerance
StrategyPurposeTypical implementation / example
TimeoutsFail fast when operations take too longConfigure client/server timeouts (HTTP, DB) to avoid blocking threads
Retries with exponential backoffRecover from transient failures without overwhelming servicesRetry on 5xx or network errors with jittered exponential backoff
Circuit breakersPrevent repeated calls to failing downstream servicesTrack error rates, open circuit after threshold, probe for recovery
BulkheadsIsolate failures to a subset of resourcesSeparate thread pools/queues per downstream dependency or tenant
Rate limiting / ThrottlingProtect services from overloadAPI gateway limits requests per client/IP or globally
Idempotency & safe retriesEnsure repeated attempts don’t cause incorrect side effectsUse idempotency keys, transactional semantics, or record deduplication
Monitoring & alertingDetect failure modes early and enable responseInstrument metrics, traces, logs; create alerts for SLO/SLA breaches
Practical guidance
  • Combine patterns: use timeouts + retries + circuit breakers for robust remote calls.
  • Make retries safe: design APIs and operations to be idempotent or use deduplication keys.
  • Use bulkheads to prevent a failing dependency from exhausting shared resources (threads, connections).
  • Add observability: metrics, distributed tracing, and structured logs are essential for diagnosing intermittent issues.
  • Test with chaos experiments (chaos engineering) and load testing to reveal behavior under failure scenarios.
  • Design for graceful degradation: return reduced functionality rather than complete failure when components degrade.
Be careful with naive retries: aggressive retries can amplify load and cause cascading failures. Always use backoff, jitter, and circuit breakers—and ensure operations are idempotent when possible.
When to apply each pattern
  • Timeouts: always for external IO (HTTP, DB, RPC).
  • Retries: for transient network or service errors, not for persistent failures.
  • Circuit breakers: when a downstream service has intermittent high latency or error spikes.
  • Bulkheads: in multi-tenant systems or when a single dependency can monopolize resources.
  • Rate limiting: at service boundaries (API gateways) and for abusive clients.
References and further reading Use these principles and patterns to design systems that recover quickly, limit blast radius, and keep users’ critical flows available even under stress.

Watch Video