How to identify, classify, and mitigate service dependencies using resilience patterns like circuit breakers, fallbacks, and bulkheads to reduce blast radius and improve system reliability.
Welcome back. Now that we’ve explored why simplicity matters, this lesson focuses on managing dependencies — everything your system relies on: third‑party libraries, external APIs, internal services, and infrastructure components. Left unmanaged, dependencies increase complexity and the risk of outages, regressions, and operational toil.Think about shipping a feature only to discover a transitive library introduced a breaking change deep in your stack. You aren’t just running your own code: every dependency expands your risk surface. Common failure modes include:
Availability coupling — if a dependency is down, your service might be down too.
Latency coupling — slow dependencies can determine your response time.
Cascading failures — one failure can trigger a domino effect across components.
Capacity coupling — a dependency under pressure can overwhelm your system (or vice versa).
Well-managed dependencies reduce these risks. Common resilience patterns include circuit breakers, fallbacks and graceful degradation, bulkheads, and other isolation strategies. These patterns help contain failures, preserve core functionality, and make recovery predictable. We’ll expand on each pattern and show how to apply them to a sample system.
Dependency typesUse the following classification to reason about impact, operational requirements, and mitigation costs.
Dependency type
What it is
Reliability/ops concerns
Direct dependencies
Your component calls another component directly
Immediate availability and latency coupling
Indirect dependencies
Dependency via a chain of calls
Harder to observe and reason about; transitive failures
Runtime dependencies
Services required when the app runs (APIs, DBs, caches)
Live availability, connection pooling, timeouts
Build‑time dependencies
Libraries, frameworks, CI/CD tooling used to build/deploy
Supply-chain, reproducibility, and build-time failures
Blast radius and prioritizationBlast radius measures how many services, users, or business capabilities are affected when a dependency fails. Estimating blast radius helps prioritize resilience work. Consider:
Dependent services — how many services rely on this dependency?
Criticality of dependent paths — are core user journeys impacted?
Traffic volume — how much user activity traverses the dependency?
Recovery time — how quickly can the system be restored?
Use these factors to decide which dependencies deserve investment (e.g., highly critical + high traffic = top priority).
Applying this to the KodeKloud Record StoreKey runtime dependencies for the KodeKloud Records Store include the API service, PostgreSQL database, RabbitMQ for messaging, Celery workers for background tasks, and observability components like Prometheus and Jaeger. Mapping these dependencies reveals which components are most critical and the potential blast radius.
The diagram shows example ports for the demo environment. Common defaults are: PostgreSQL 5432, Grafana 3000, RabbitMQ 5672, and Loki 3100. Always verify and use the ports configured for your environment.
Queue for later delivery, retry logic, batched/manual fallback
Dependency management strategiesCircuit breakersCircuit breakers stop repeated calls to a failing dependency and allow your system to fail fast rather than hanging while waiting. They generally have three states:
Closed — calls proceed normally.
Open — calls are blocked because failures exceeded a threshold.
Half‑open — a limited number of test calls are allowed to see if the dependency recovered.
Use mature libraries when possible: Resilience4j for Java, PyBreaker for Python.Example pseudocode (illustrative):
Fallbacks and graceful degradationFallbacks offer alternative behaviors when dependencies fail so your service can preserve core value. Key concepts:
Graceful degradation — keep core functionality even if enhancements fail.
Partial availability — serve essential data or simplified UI.
Functional core vs enhancement shell — separate must-haves from optional features.
Progressive enhancement — enable extras only when resources permit.
Common fallbacks:
Return cached data when a backend is unavailable.
Provide simplified functionality or default values.
Queue work for later processing (retry/queueing).
Route to manual processes or notifications when automation fails.
BulkheadsBulkheads isolate resources and failure domains so problems in one area don’t sink the whole system. Implementations include:
Separate thread pools for different dependencies.
Deploying critical functions as independent services.
Dedicated databases or connection pools by domain.
Partitioning requests by user type or criticality.
These approaches reduce resource contention and limit blast radius.
Benefits of bulkheads:
Prevents resource exhaustion from spreading.
Allows partial system functionality during outages.
Creates clear isolation boundaries.
Simplifies testing and deployment and improves overall resilience.
Bulkhead examples:
Separate task queues for different workloads.
Independent connection pools per domain or service.
Per‑dependency resource limits (CPU, threads).
Isolated failure domains for critical vs non‑critical services.
Real‑world example: Amazon S3 outage (Feb 28, 2017)What happened: an engineer debugging a billing issue executed a command intended to remove a small set of servers from an S3 subsystem. Due to incorrect input, a much larger set was removed. The removal forced two critical subsystems (index and placement) to restart, making S3 unavailable in the US‑East‑1 region for about 3.5 hours. The outage’s blast radius extended to services that relied on S3 metadata and storage.
Removing significant capacity forced a full restart and caused S3 to be unavailable for approximately 3.5 hours. The outage affected multiple AWS services that depended on S3.
Lessons learned and mitigationsEven mature systems are vulnerable to simple human error. Amazon implemented several mitigations after the incident:
Throttle capacity-removal operations to avoid large accidental removals.
Add safeguards to block removals that would violate minimum capacity rules.
Improve operational tooling and guardrails to reduce human error.
Strengthen dashboards and health checks for faster, safer recovery.
Harden dependency and failover management to reduce blast radius.
Wrap upManaging dependencies requires deliberate mapping, prioritization, and design for failure. Apply these steps:
Map dependencies and classify them by criticality and blast radius.
Prioritize investments where risk and impact are highest.
Use patterns—circuit breakers, fallbacks, bulkheads, graceful degradation—to contain failures.
Automate guards and deploy operational tooling and observability to detect and recover quickly.
In the next lesson, we’ll explore change management and safe deployment practices.