Fault Injection - KodeKloud

Circuit breaking and fault injection are essential resilience techniques for microservices. Fault injection lets you safely simulate downstream failures—such as slow databases or unresponsive services—so you can validate how your system behaves and recovers before real incidents occur. What happens if the product service is slow to respond? What if the homepage service starts returning errors? These are important scenarios to test when hardening your architecture for production.

A slide titled "Resilience" showing a puzzled cartoon person on the left and three numbered questions on the right about testing application behavior and what happens when services (Products, Homepage) or the database are slow. The slide is copyrighted to KodeKloud.

If your organization runs periodic fire drills to prepare people for emergencies, fault injection in an Istio service mesh is the equivalent for your microservices: intentionally inject delays or errors between services to observe real application behavior and validate recovery strategies.

A schematic diagram titled "Fault Injection" showing a Kubernetes cluster with an Istio control plane, two nodes with services and Envoy sidecars, and a large red X marking a failed component. Dashed arrows indicate traffic flow and icons at the bottom show alerts and degraded performance.

Why use fault injection?

Validate how applications respond to slow or failing dependencies.
Test fallback logic and ensure graceful degradation (e.g., return cached results, use a backup service, or return a friendly error).
Ensure timeouts and retry policies prevent cascading failures.
Detect bugs, misconfigurations, and missing error handling early in the CI/CD pipeline.
Build confidence in resilience through repeatable experiments (chaos engineering). A well-known example is Netflix’s Chaos Monkey.

Fault injection provides a safe, repeatable way to validate reliability, fallbacks, and recovery strategies before a real outage occurs.

How Istio configures fault injection Fault injection is not a separate Istio resource. Instead, you configure faults inside a VirtualService using the fault section on http routes. The two primary types are:

Delay: injects artificial latency (e.g., fixedDelay: 5s) for a percentage of requests.
Abort: returns an HTTP error code (e.g., httpStatus: 400) or a gRPC status for a percentage of requests.

Table: quick comparison

Fault Type	Purpose	Key fields	Typical use
Delay	Simulate increased latency	`fault.delay.fixedDelay`, `percentage.value`	Test timeouts, circuit breakers, and slow downstream services
Abort	Simulate error responses	`fault.abort.httpStatus` or `fault.abort.grpcStatus`, `percentage.value`	Test error handling, fallbacks, and retry logic

Example: inject a 5-second delay for 100% of requests to app-svc in the frontend namespace:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: app-vs
  namespace: frontend
spec:
  hosts:
  - app-svc
  http:
  - fault:
      delay:
        percentage:
          value: 100.0
        fixedDelay: 5s
    route:
    - destination:
        host: app-svc.frontend.svc.cluster.local
        port:
          number: 80
        subset: v1

Example: abort 50% of requests with HTTP 400 to test client error handling:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: app-vs-abort
  namespace: frontend
spec:
  hosts:
  - app-svc
  http:
  - fault:
      abort:
        percentage:
          value: 50.0
        httpStatus: 400
    route:
    - destination:
        host: app-svc.frontend.svc.cluster.local
        port:
          number: 80
        subset: v1

Additional common scenarios

Abort a small percentage to simulate intermittent client errors.
Inject delays only for traffic from specific sources (e.g., production labeled traffic) to limit blast radius.

Examples:

ratings-route: abort 10% of requests with HTTP 400.
reviews-route: inject a 5-second delay for 10% of requests coming from sources labeled env: prod.

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: ratings-route
spec:
  hosts:
  - ratings.prod.svc.cluster.local
  http:
  - route:
    - destination:
        host: ratings.prod.svc.cluster.local
        subset: v1
    fault:
      abort:
        percentage:
          value: 10.0
        httpStatus: 400

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews-route
spec:
  hosts:
  - reviews.prod.svc.cluster.local
  http:
  - match:
    - sourceLabels:
        env: prod
    route:
    - destination:
        host: reviews.prod.svc.cluster.local
        subset: v1
    fault:
      delay:
        percentage:
          value: 10.0
        fixedDelay: 5s

Important notes

Use percentage.value to control the blast radius. Start small (e.g., 1–5%) when testing in production-like environments.
Combine fault injection with observability (metrics, logs, tracing) so you can measure impact and validate fallback behavior.
Prefer targeted matches (source labels, headers, or subsets) to limit scope.

Be careful when running fault injection in production. Always limit the impact with percentage-based targeting, source filters, and monitoring. Do not enable broad, 100% faults against critical services without coordinated rollback plans.

For more details, examples, and edge-case behaviors, see the official Istio documentation:

Istio Fault Injection: https://istio.io/latest/docs/tasks/traffic-management/fault-injection/
Chaos engineering inspiration: Netflix Chaos Monkey — https://netflix.github.io/chaosmonkey/

Let’s head over to a demo.

Watch Video