Implementing SLIs

Service Level Indicators (SLIs) are the vital signs of your system — measurable signals that tell you whether users are getting the experience you promise. Choosing the right SLIs helps you detect problems quickly, prioritize fixes, and set meaningful reliability targets. Think of picking SLIs like being a detective. Monitoring is your network of cameras and alarms that alert you when a specific condition happens. Observability is the forensic toolkit that lets you reconstruct what happened from traces, logs, and metrics.

A presentation slide titled "Choosing the Right SLI Tools for the Job" showing a house with a burglar, a security camera and a green "Observability" banner. To the right are icons and labels for "Footprints," "Fingerprints," and "Camera footage."

Even if an incident unfolds in an unexpected way, good observability lets you recreate the story and act on it. Not every service is measured the same way. There are five primary SLI categories to consider:

A presentation slide titled "Choosing the Right SLI Tools for the Job" that lists five numbered SLI categories: 01 Availability SLIs, 02 Latency SLIs, 03 Errors SLIs, 04 Throughput SLIs, and 05 Saturation SLIs.

Availability — percentage of requests successfully handled. Typical for APIs and web UIs.
Latency — response time distribution; measure with percentiles, not averages.
Errors — failure rate; be explicit about what counts as an error.
Throughput — amount of work processed (requests/sec, items/min).
Saturation — how close a resource is to its capacity (CPU, memory, queues).

Select SLIs that map directly to user experience — if you measure the wrong thing, you’ll optimize in the wrong direction.

SLIs at a glance

SLI Type	What it measures	Common metric examples
Availability	Fraction of valid requests that succeed	2xx/total requests (%)
Latency	Speed of responses (tail behavior)	P95, P99 latency (s)
Errors	Fraction of failing requests	5xx rate (%)
Throughput	Work completed per time window	requests/sec
Saturation	Resource usage relative to capacity	CPU%, queue depth

Choosing the right SLI definition is critical. Small query mismatches (e.g., wrong label, wrong endpoint, or counting retries) can produce misleading SLIs and wrong decisions. Prometheus / PromQL: common SLI query patterns

# Availability: successful requests divided by total requests (percentage)
sum(rate(http_requests_total{service="api", handler="/catalog", status_code=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="api", handler="/catalog"}[5m])) * 100

# Latency: 95th/99th percentile from a histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate: errors divided by total requests
sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Throughput: rate of requests processed
rate(http_requests_total[5m])

# Saturation example: CPU usage as fraction of total (non-idle)
sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) /
sum(rate(node_cpu_seconds_total[5m]))

Choose queries that reflect your precise SLI definition — include correct labels (service, handler, endpoint) and ensure windows (e.g., 5m) match your use case. Tools like PromLens and Grafana Query Builder make constructing and validating PromQL queries easier. White box monitoring gives the most precise SLI signals: instrument your code and infrastructure so metrics, logs, and traces flow to a centralized observability stack.

A slide diagram titled "Let's Talk White Boxes" showing Prometheus and Grafana icons feeding into a "Prometheus Exporter" component. Arrows from the exporter point to a 3D "Whitebox Monitoring" box with upward labels for Metrics, Logs, and Traces.

White-box instrumentation answers questions like: how many requests succeed, how long they take, what resources they consume, and how a single request flows through services. Availability

A slide titled "Availability SLIs" showing a dashboard with a large 100% API availability gauge and an adjacent time-series availability history chart. The subtitle reads "Percentage of valid requests that are successfully served."

Availability SLI formula: Successful requests / Valid requests × 100%

A presentation slide titled "Availability SLIs" that shows the formula "Successful requests / Valid requests × 100%" and a "Use Cases" box with icons for APIs and Web services.

Key availability metrics to track:

Total HTTP requests (baseline context).
Error counts broken down by class (4xx vs 5xx).
Success ratio (percentage of successful requests).

Prometheus examples for availability:

# Total requests rate
sum(rate(http_requests_total[5m]))

# Server-side errors (5xx) rate
sum(rate(http_requests_total{status_code=~"5.."}[5m]))

# Success ratio (2xx / total)
sum(rate(http_requests_total{status_code=~"2.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

Latency

Latency SLIs focus on user-perceived speed. Use histograms and percentiles (P95, P99) rather than averages to capture tail latency. A common latency SLI: Percentage of requests faster than X ms (e.g., 300 ms)

A slide titled "Latency SLIs" showing the formula "Percentage of requests faster than threshold" and a "Use Cases" box listing user-facing applications, APIs, and database queries. The slide is from KodeKloud (copyright noted).

Common latency signals:

Histogram buckets for request durations (http_request_duration_seconds_bucket).
95th/99th percentile values (histogram_quantile).
Counts of requests exceeding an unacceptable threshold.

Prometheus percentile example:

# 99th percentile latency for the /search endpoint (5m window)
histogram_quantile(
  0.99,
  sum(rate(http_request_duration_seconds_bucket{service="api", endpoint="/search"}[5m])) by (le)
)

Use PromLens or Grafana Query Builder to validate percentile queries and ensure the bucket set aligns with your SLI threshold.

A presentation slide titled "Using PromLens to Learn and Practice SLI Query Formation" showing a screenshot of the PromLens website and a query builder interface. Below the image are two points saying Grafana's Query Builder lets you build PromQL queries without coding and that selecting metrics, labels, and filters helps build the final query.

Error, Throughput, and Saturation Error SLIs measure request failures. Make explicit what counts as an error (HTTP 5xx, application-level failures, retries exhausted) and track both failure rate and success rate for different audiences (engineering vs. SLO reporting).

A slide titled "Error, Throughput, and Saturation SLIs" explaining Error SLIs as the percentage of requests that fail due to server-side errors. It shows formulas for failure rate (Error responses / Total requests × 100%) and success rate (1 − (Error responses / Total requests) × 100%).

Throughput SLIs show how much work a system completes in a time window. Drops in throughput can indicate backpressure, queueing, or dropped messages.

A slide titled "Error, Throughput, and Saturation SLIs" explaining Throughput SLIs as the rate of requests successfully processed. It shows the formula: Valid requests / Time window.

Saturation SLIs reveal resource headroom. Rising saturation is an early warning — as CPU, memory, or queue depth approaches limits, latency and error rates often follow. Use saturation metrics to drive autoscaling and preventive alerts.

A presentation slide titled "Error, Throughput, and Saturation SLIs" with three numbered use-case boxes. The boxes describe (01) APIs/services where request volume is a key indicator, (02) batch jobs or queue workers tracking processed items per minute, and (03) streaming platforms.

A presentation slide titled "Error, Throughput, and Saturation SLIs" showing three numbered use-case panels. The panels describe monitoring for CPU/memory/disk/network exhaustion, autoscaling triggers based on CPU/memory usage, and preventive alerting, each with a corresponding icon.

Applying SLIs to a real app: KodeKloud Record Store A practical example helps anchor SLI choices. The KodeKloud Record Store API exposes endpoints like product catalog, search, order creation, order status, and background processing. Each user journey is composed of multiple steps — if one step is slow or failing, the whole journey suffers.

A presentation slide titled "Let's Get Practical: KodeKloud Record Store API" showing an "API Service" section with a highlighted box labeled "Relevant user journeys" that lists three items: Browsing records, Searching, and Viewing details, each with an icon.

For the Record Store API, high-value SLIs are availability and latency. Example target: X% of search queries finish within 300 ms.

A presentation slide titled "Let's Get Practical: KodeKloud Record Store API" showing two effective SLIs. It lists Availability (percentage of catalog API requests returning successful HTTP 2xx/3xx) and Latency (percentage of search queries completing within 300 ms), each with an icon.

Prometheus examples for the Record Store (availability and latency):

# Availability SLI for the catalog endpoint (percentage of 2xx responses)
sum(rate(http_requests_total{service="api", handler="/catalog", status_code=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="api", handler="/catalog"}[5m])) * 100

# Latency SLI (99th percentile) for the search endpoint
histogram_quantile(
  0.99,
  sum(rate(http_request_duration_seconds_bucket{service="api", endpoint="/search"}[5m])) by (le)
)

User journeys span endpoints (products, orders, background processing). Define SLIs for each important step to avoid blindspots: one slow endpoint can degrade the entire journey.

A slide diagram of the "KodeKloud Record Store" microservices architecture showing a central record store connected to Observability, Storage, a Core Microservice, and Async Processing. The observability stack lists Prometheus, Grafana, Jaeger, Loki, AlertManager, Blackbox Exporter and Fluent Bit; storage is PostgreSQL; the core API handles orders/products and async processing uses RabbitMQ and Celery workers.

Ordering journey example When a user places an order, typical steps include POST /orders (create), background fulfillment (e.g., Celery), and status updates. Define SLIs for each step:

Order creation availability: percentage of POST /orders requests that succeed.
Order creation latency: how quickly the order is accepted/confirmed.
Order processing success rate: percentage of background tasks that complete successfully.
End-to-end processing time: percentage of orders processed within a target timeframe.

Prometheus examples for ordering SLIs:

# Order creation availability (percentage)
sum(rate(http_requests_total{handler="/orders", method="POST", status_code=~"2.."}[5m]))
/
sum(rate(http_requests_total{handler="/orders", method="POST"}[5m])) * 100

# Order processing success rate (Celery)
sum(rate(celery_tasks_total{task_name="process_order", status="success"}[5m]))
/
sum(rate(celery_tasks_total{task_name="process_order"}[5m])) * 100

# Percentage of orders processed within 5 seconds (example)
sum(rate(order_processing_time_seconds_bucket{job="order_processor", le="5.0"}[5m]))
/
sum(rate(order_processing_time_seconds_count{job="order_processor"}[5m])) * 100

Collecting SLI data Combine collection methods for robust coverage:

Application instrumentation — instrument code to expose metrics.
Load balancer / proxy metrics — infrastructure-level view without code changes.
Client-side instrumentation — browser/mobile telemetry to measure real user experience.
Synthetic testing — automated probes simulating user flows.

A presentation slide titled "SLI Collection Methods" showing a central "Implementing SLIs" circle connected to four data collection methods: Application Instrumentation, Load Balancer/Proxy Metrics, Synthetic Testing, and Client-Side Instrumentation. The slide also includes a small © Copyright KodeKloud note.

Collection methods compared

Collection method	Strengths	When to use
Application instrumentation	Precise, fine-grained SLI labels	SLI definitions tied to business logic
Load balancer / proxy	Easy infrastructure-level metrics	Quick availability checks
Client-side telemetry	Real user experience metrics	Frontend performance & UX
Synthetic testing	Controlled, repeatable checks	24/7 availability & SLA verification

Application instrumentation (example) Instrument applications to expose Prometheus metrics. Example using FastAPI and prometheus_client:

from fastapi import FastAPI
from fastapi.responses import Response
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = FastAPI()

# Track application-level SLIs
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP Requests",
    ["method", "endpoint", "status_code"]
)

REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP Request Duration in seconds",
    ["endpoint"],
    buckets=[0.1, 0.5, 1.0, 5.0]
)

@app.get("/products")
def get_products():
    start_time = time.time()

    # Your business logic here
    # products = db.query(Product).all()
    products = [{"id": 1, "name": "Vinyl A"}]  # placeholder

    # Record metrics
    REQUEST_COUNT.labels(method="GET", endpoint="/products", status_code="200").inc()
    REQUEST_DURATION.labels(endpoint="/products").observe(time.time() - start_time)

    return products

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Expose /metrics so Prometheus can scrape application metrics. This provides precise, near real-time SLI signals from inside the service. Synthetic monitoring (example) Synthetic checks simulate user behavior and provide continuous health and latency measurements even when real traffic is low. Example bash loop:

#!/bin/bash

while true; do
  echo "Performing health check..."

  # Check API health endpoint
  response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/health)

  if [ "$response" -eq 200 ]; then
    echo "API is healthy (HTTP $response)"
  else
    echo "API is unhealthy (HTTP $response)"
    # In a real environment, this would trigger an alert
  fi

  # Check response time (simulating user experience)
  start_time=$(date +%s.%N)
  curl -s http://localhost:8000/ > /dev/null
  end_time=$(date +%s.%N)

  duration=$(echo "$end_time - $start_time" | bc -l)
  printf "Response time: %.3fs\n" "$duration"

  if (( $(echo "$duration > 1.0" | bc -l) )); then
    echo "Warning: Response time exceeds 1 second"
    # In a real environment, this would trigger an alert
  fi

  echo "-------------------------------"
  sleep 30
done

Use synthetic tests to validate SLOs during off-peak times and to catch regressions introduced by deployments.

Do not rely on a single data source. Combine application metrics, proxy metrics, client telemetry, and synthetic checks to avoid blindspots.

Summary and next steps You now have the foundations to select, define, and collect SLIs:

Pick SLIs that reflect user experience (availability, latency, errors, throughput, saturation).
Implement precise queries and validate them with tools like PromLens and Grafana.
Instrument the application and combine collection methods to ensure complete coverage.
Define SLIs for each step in important user journeys to avoid blindspots.

Next: translate SLIs into Service Level Objectives (SLOs) — actionable targets that balance user expectations with operational realities. We’ll cover SLO strategy, error budgets, and how to integrate reliability into development workflows. Further reading and references

Prometheus: https://prometheus.io
Grafana: https://grafana.com
PromLens: https://promlens.com
Celery documentation: https://docs.celeryq.dev/en/stable/

Watch Video

Practice Lab