Skip to main content

Documentation Index

Fetch the complete documentation index at: https://notes.kodekloud.com/llms.txt

Use this file to discover all available pages before exploring further.

Circuit breaking and fault injection are essential resilience techniques for microservices. Fault injection lets you safely simulate downstream failures—such as slow databases or unresponsive services—so you can validate how your system behaves and recovers before real incidents occur. What happens if the product service is slow to respond? What if the homepage service starts returning errors? These are important scenarios to test when hardening your architecture for production.
A slide titled "Resilience" showing a puzzled cartoon person on the left and three numbered questions on the right about testing application behavior and what happens when services (Products, Homepage) or the database are slow. The slide is copyrighted to KodeKloud.
If your organization runs periodic fire drills to prepare people for emergencies, fault injection in an Istio service mesh is the equivalent for your microservices: intentionally inject delays or errors between services to observe real application behavior and validate recovery strategies.
A schematic diagram titled "Fault Injection" showing a Kubernetes cluster with an Istio control plane, two nodes with services and Envoy sidecars, and a large red X marking a failed component. Dashed arrows indicate traffic flow and icons at the bottom show alerts and degraded performance.
Why use fault injection?
  • Validate how applications respond to slow or failing dependencies.
  • Test fallback logic and ensure graceful degradation (e.g., return cached results, use a backup service, or return a friendly error).
  • Ensure timeouts and retry policies prevent cascading failures.
  • Detect bugs, misconfigurations, and missing error handling early in the CI/CD pipeline.
  • Build confidence in resilience through repeatable experiments (chaos engineering). A well-known example is Netflix’s Chaos Monkey.
A presentation slide titled "Why Use Fault Injection?" showing five numbered reasons: testing how the app handles failure, testing fallback logic, checking timeouts and retries, detecting weak spots, and practicing chaos engineering.
Fault injection provides a safe, repeatable way to validate reliability, fallbacks, and recovery strategies before a real outage occurs.
How Istio configures fault injection Fault injection is not a separate Istio resource. Instead, you configure faults inside a VirtualService using the fault section on http routes. The two primary types are:
  • Delay: injects artificial latency (e.g., fixedDelay: 5s) for a percentage of requests.
  • Abort: returns an HTTP error code (e.g., httpStatus: 400) or a gRPC status for a percentage of requests.
Table: quick comparison
Fault TypePurposeKey fieldsTypical use
DelaySimulate increased latencyfault.delay.fixedDelay, percentage.valueTest timeouts, circuit breakers, and slow downstream services
AbortSimulate error responsesfault.abort.httpStatus or fault.abort.grpcStatus, percentage.valueTest error handling, fallbacks, and retry logic
Example: inject a 5-second delay for 100% of requests to app-svc in the frontend namespace:
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: app-vs
  namespace: frontend
spec:
  hosts:
  - app-svc
  http:
  - fault:
      delay:
        percentage:
          value: 100.0
        fixedDelay: 5s
    route:
    - destination:
        host: app-svc.frontend.svc.cluster.local
        port:
          number: 80
        subset: v1
Example: abort 50% of requests with HTTP 400 to test client error handling:
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: app-vs-abort
  namespace: frontend
spec:
  hosts:
  - app-svc
  http:
  - fault:
      abort:
        percentage:
          value: 50.0
        httpStatus: 400
    route:
    - destination:
        host: app-svc.frontend.svc.cluster.local
        port:
          number: 80
        subset: v1
Additional common scenarios
  • Abort a small percentage to simulate intermittent client errors.
  • Inject delays only for traffic from specific sources (e.g., production labeled traffic) to limit blast radius.
Examples:
  • ratings-route: abort 10% of requests with HTTP 400.
  • reviews-route: inject a 5-second delay for 10% of requests coming from sources labeled env: prod.
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: ratings-route
spec:
  hosts:
  - ratings.prod.svc.cluster.local
  http:
  - route:
    - destination:
        host: ratings.prod.svc.cluster.local
        subset: v1
    fault:
      abort:
        percentage:
          value: 10.0
        httpStatus: 400
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews-route
spec:
  hosts:
  - reviews.prod.svc.cluster.local
  http:
  - match:
    - sourceLabels:
        env: prod
    route:
    - destination:
        host: reviews.prod.svc.cluster.local
        subset: v1
    fault:
      delay:
        percentage:
          value: 10.0
        fixedDelay: 5s
Important notes
  • Use percentage.value to control the blast radius. Start small (e.g., 1–5%) when testing in production-like environments.
  • Combine fault injection with observability (metrics, logs, tracing) so you can measure impact and validate fallback behavior.
  • Prefer targeted matches (source labels, headers, or subsets) to limit scope.
Be careful when running fault injection in production. Always limit the impact with percentage-based targeting, source filters, and monitoring. Do not enable broad, 100% faults against critical services without coordinated rollback plans.
For more details, examples, and edge-case behaviors, see the official Istio documentation: Let’s head over to a demo.

Watch Video