Managing Dependencies

Welcome back. Now that we’ve explored why simplicity matters, this lesson focuses on managing dependencies — everything your system relies on: third‑party libraries, external APIs, internal services, and infrastructure components. Left unmanaged, dependencies increase complexity and the risk of outages, regressions, and operational toil. Think about shipping a feature only to discover a transitive library introduced a breaking change deep in your stack. You aren’t just running your own code: every dependency expands your risk surface. Common failure modes include:

Availability coupling — if a dependency is down, your service might be down too.
Latency coupling — slow dependencies can determine your response time.
Cascading failures — one failure can trigger a domino effect across components.
Capacity coupling — a dependency under pressure can overwhelm your system (or vice versa).

A slide titled "The Dependency Challenge" that lists four dependency risks—Availability coupling, Latency coupling, Cascading failures, and Capacity coupling—each shown with a colored icon. Each risk has a one-line explanation about how dependencies affect availability, speed, failure spread, and capacity.

Well-managed dependencies reduce these risks. Common resilience patterns include circuit breakers, fallbacks and graceful degradation, bulkheads, and other isolation strategies. These patterns help contain failures, preserve core functionality, and make recovery predictable. We’ll expand on each pattern and show how to apply them to a sample system.

A presentation slide titled "The Dependency Challenge" showing four colorful cards labeled Circuit Breakers, Fallbacks, Bulkheads, and Graceful Degradation with short descriptions. Each card summarizes a resilience strategy for managing dependencies (prevent cascade failures, maintain functionality during outages, isolate failures, and preserve core business value).

Dependency types Use the following classification to reason about impact, operational requirements, and mitigation costs.

Dependency type	What it is	Reliability/ops concerns
Direct dependencies	Your component calls another component directly	Immediate availability and latency coupling
Indirect dependencies	Dependency via a chain of calls	Harder to observe and reason about; transitive failures
Runtime dependencies	Services required when the app runs (APIs, DBs, caches)	Live availability, connection pooling, timeouts
Build‑time dependencies	Libraries, frameworks, CI/CD tooling used to build/deploy	Supply-chain, reproducibility, and build-time failures

A presentation slide titled "The Dependency Challenge" that lists four dependency types. They are Direct (Component A calls Component B), Indirect (A relies on B through an intermediary), Runtime (external services, databases, caches), and Build‑Time (libraries, frameworks, tools).

Blast radius and prioritization Blast radius measures how many services, users, or business capabilities are affected when a dependency fails. Estimating blast radius helps prioritize resilience work. Consider:

Dependent services — how many services rely on this dependency?
Criticality of dependent paths — are core user journeys impacted?
Traffic volume — how much user activity traverses the dependency?
Recovery time — how quickly can the system be restored?

Use these factors to decide which dependencies deserve investment (e.g., highly critical + high traffic = top priority).

A presentation slide titled "Blast Radius Analysis" showing a table of factors. Four colored rows list characteristics (Dependent Services, Criticality, Traffic Volume, Recovery Time) each paired with a short description about impact or restoration.

Applying this to the KodeKloud Record Store Key runtime dependencies for the KodeKloud Records Store include the API service, PostgreSQL database, RabbitMQ for messaging, Celery workers for background tasks, and observability components like Prometheus and Jaeger. Mapping these dependencies reveals which components are most critical and the potential blast radius.

A system dependency diagram for the KodeKloud Records Store showing the web UI and microservices and how they connect to messaging, database, and observability components. It includes Web API (port 8000), RabbitMQ (5672), PostgreSQL (9432), Celery workers, and an observability stack with Grafana (5000), Loki (3100) and Alertmanager (9093).

The diagram shows example ports for the demo environment. Common defaults are: PostgreSQL 5432, Grafana 3000, RabbitMQ 5672, and Loki 3100. Always verify and use the ports configured for your environment.

High‑level component map (conceptual)

KodeKloud Records Store
├── API Service (FastAPI)
│   ├── Routes/Endpoints
│   │   ├── /products - Product management
│   │   ├── /orders - Order management
│   │   ├── /checkout - Order processing
│   │   ├── /health - Health checks
│   │   ├── /trace-test - Diagnostic tracing
│   │   ├── /slow-operation - Latency simulation
│   │   └── /error-test - Error generation
│   ├── Database Connection
│   │   └── PostgreSQL Database
│   │       ├── Products Table
│   │       └── Orders Table
│   ├── Background Processing
│   │   ├── Celery Worker
│   │   │   ├── Process Order Task
│   │   │   └── Send Order Confirmation Task
│   │   └── RabbitMQ Message Queue
│   ├── Observability Stack
│   │   ├── Metrics Collection
│   │   │   ├── Prometheus
│   │   │   └── Pushgateway (for batch metrics)
│   │   ├── Logs Management
│   │   │   ├── Fluent Bit (collection)
│   │   │   └── Loki (storage)
│   │   ├── Tracing
│   │   │   └── Jaeger
│   │   ├── Monitoring
│   │   │   ├── Grafana (dashboards)
│   │   │   ├── Alertmanager (alerts)
│   │   │   └── Blackbox Exporter (synthetic testing)
│   │   └── Telemetry Instrumentation
│   │       ├── FastAPI Instrumentation
│   │       ├── SQLAlchemy Instrumentation
│   │       └── Celery Instrumentation
└── Infrastructure
    └── Docker Compose Environment

Classifying dependencies Not all dependencies need the same level of investment. Classify them to focus mitigation efforts on what matters most:

A presentation slide titled "Dependency Classification" that explains classifying dependencies by criticality. It shows four labeled boxes—Critical, Important, Non-Critical, and External—each with a short description of that dependency type.

Example classification for KodeKloud Record Store

Dependency	Classification	Typical mitigations
PostgreSQL	Critical	Connection pooling, timeouts, read replicas, backups
API service (FastAPI)	Critical	Autoscaling, load balancing, liveness/readiness probes
RabbitMQ	Important	Replicated brokers, local queueing fallback, synchronous fallback
Celery workers	Important	Task timeouts, dead-letter queues, isolated worker pools
Monitoring stack (Prometheus, Grafana, Loki)	Non‑critical for core function	Local buffering, reduced sampling, rate limiting
Email notifications	Non‑critical	Queue for later delivery, retry logic, batched/manual fallback

Dependency management strategies Circuit breakers Circuit breakers stop repeated calls to a failing dependency and allow your system to fail fast rather than hanging while waiting. They generally have three states:

Closed — calls proceed normally.
Open — calls are blocked because failures exceeded a threshold.
Half‑open — a limited number of test calls are allowed to see if the dependency recovered.

Use mature libraries when possible: Resilience4j for Java, PyBreaker for Python. Example pseudocode (illustrative):

# Pseudocode: simple circuit breaker logic (illustrative)
class CircuitBreaker:
    def __init__(self, failure_threshold, timeout_seconds):
        self.failures = 0
        self.state = "CLOSED"
        self.failure_threshold = failure_threshold
        self.open_since = None
        self.timeout_seconds = timeout_seconds

    def record_failure(self):
        self.failures += 1
        if self.failures >= self.failure_threshold:
            self.state = "OPEN"
            self.open_since = now()

    def reset(self):
        self.failures = 0
        self.state = "CLOSED"
        self.open_since = None

    def is_allowed(self):
        if self.state == "CLOSED":
            return True
        if self.state == "OPEN":
            if now() - self.open_since >= self.timeout_seconds:
                self.state = "HALF_OPEN"
                return True  # allow limited requests
            return False
        if self.state == "HALF_OPEN":
            return True  # allow test requests

# Usage:
cb = CircuitBreaker(failure_threshold=5, timeout_seconds=30)
if cb.is_allowed():
    try:
        response = call_dependency()
        cb.reset()
    except Exception:
        cb.record_failure()
        return fallback_response()
else:
    return fallback_response()

Fallbacks and graceful degradation Fallbacks offer alternative behaviors when dependencies fail so your service can preserve core value. Key concepts:

Graceful degradation — keep core functionality even if enhancements fail.
Partial availability — serve essential data or simplified UI.
Functional core vs enhancement shell — separate must-haves from optional features.
Progressive enhancement — enable extras only when resources permit.

Common fallbacks:

Return cached data when a backend is unavailable.
Provide simplified functionality or default values.
Queue work for later processing (retry/queueing).
Route to manual processes or notifications when automation fails.

A presentation slide titled "Fallbacks and Graceful Degradation" that shows four colored panels describing strategies—Graceful Degradation, Partial Availability, Functional Core vs Enhancement Shell, and Progressive Enhancement—with short definitions. It outlines approaches for keeping services functioning or degrading gracefully when parts of a system fail.

A slide titled "Fallbacks and Graceful Degradation" showing a service that can't reach a dependency and routes to an "Alternative behavior" box. The alternative then branches to options like cached data, simplified functionality, default values, queue for later, and manual processing.

Bulkheads Bulkheads isolate resources and failure domains so problems in one area don’t sink the whole system. Implementations include:

Separate thread pools for different dependencies.
Deploying critical functions as independent services.
Dedicated databases or connection pools by domain.
Partitioning requests by user type or criticality.

These approaches reduce resource contention and limit blast radius.

A presentation slide titled "Dependency Management Strategies" that lists four bulkhead implementation options: Separate Thread Pools, Separate Services, Separate Databases, and Request Partitioning, each with a short description and circular icon. The items are arranged horizontally with simple line art and brief explanatory text.

Benefits of bulkheads:

Prevents resource exhaustion from spreading.
Allows partial system functionality during outages.
Creates clear isolation boundaries.
Simplifies testing and deployment and improves overall resilience.

Bulkhead examples:

Separate task queues for different workloads.
Independent connection pools per domain or service.
Per‑dependency resource limits (CPU, threads).
Isolated failure domains for critical vs non‑critical services.

Real‑world example: Amazon S3 outage (Feb 28, 2017) What happened: an engineer debugging a billing issue executed a command intended to remove a small set of servers from an S3 subsystem. Due to incorrect input, a much larger set was removed. The removal forced two critical subsystems (index and placement) to restart, making S3 unavailable in the US‑East‑1 region for about 3.5 hours. The outage’s blast radius extended to services that relied on S3 metadata and storage.

A presentation slide about Amazon's 2017 S3 outage showing two critical subsystems. The "Index Subsystem" manages metadata and locations for S3 objects (for GET, LIST, PUT, DELETE) and the "Placement Subsystem" handles storage for new objects, relying on the index.

Removing significant capacity forced a full restart and caused S3 to be unavailable for approximately 3.5 hours. The outage affected multiple AWS services that depended on S3.

A presentation slide titled "Real-World Example: Amazon's 2017 S3 Outage" with a computer icon showing a circular restart symbol. The caption explains that removing capacity forced a full restart and caused S3 to be unavailable for about 3.5 hours in the US‑EAST‑1 region.

A presentation slide titled "Real-World Example: Amazon's 2017 S3 Outage" with the subheading "For reaching blast radius" and three AWS service icons labeled Amazon EC2, Amazon EBS, and AWS Lambda. It appears to illustrate which services were in the outage's blast radius.

Lessons learned and mitigations Even mature systems are vulnerable to simple human error. Amazon implemented several mitigations after the incident:

Throttle capacity-removal operations to avoid large accidental removals.
Add safeguards to block removals that would violate minimum capacity rules.
Improve operational tooling and guardrails to reduce human error.
Strengthen dashboards and health checks for faster, safer recovery.
Harden dependency and failover management to reduce blast radius.

A presentation slide titled "Real-World Example: Amazon's 2017 S3 Outage" noting that human error can cause large-scale outages. It lists two fixes Amazon made: remove capacity more slowly, and add safeguards to block removals below minimum capacity.

Wrap up Managing dependencies requires deliberate mapping, prioritization, and design for failure. Apply these steps:

Map dependencies and classify them by criticality and blast radius.
Prioritize investments where risk and impact are highest.
Use patterns—circuit breakers, fallbacks, bulkheads, graceful degradation—to contain failures.
Automate guards and deploy operational tooling and observability to detect and recover quickly.

A presentation slide titled "Real-World Example: Amazon's 2017 S3 Outage." It shows a horizontal sequence of colored arrows illustrating major changes like improved operational safety, faster safer recovery, a more resilient service health dashboard, and stronger dependency/failover management.

In the next lesson, we’ll explore change management and safe deployment practices.

Watch Video