Guide to selecting, defining, and collecting service level indicators—availability, latency, errors, throughput, saturation—with Prometheus examples, instrumentation, and synthetic tests for user-focused reliability.
Service Level Indicators (SLIs) are the vital signs of your system — measurable signals that tell you whether users are getting the experience you promise. Choosing the right SLIs helps you detect problems quickly, prioritize fixes, and set meaningful reliability targets.Think of picking SLIs like being a detective. Monitoring is your network of cameras and alarms that alert you when a specific condition happens. Observability is the forensic toolkit that lets you reconstruct what happened from traces, logs, and metrics.
Even if an incident unfolds in an unexpected way, good observability lets you recreate the story and act on it.Not every service is measured the same way. There are five primary SLI categories to consider:
Availability — percentage of requests successfully handled. Typical for APIs and web UIs.
Latency — response time distribution; measure with percentiles, not averages.
Errors — failure rate; be explicit about what counts as an error.
Throughput — amount of work processed (requests/sec, items/min).
Saturation — how close a resource is to its capacity (CPU, memory, queues).
Select SLIs that map directly to user experience — if you measure the wrong thing, you’ll optimize in the wrong direction.
SLIs at a glance
SLI Type
What it measures
Common metric examples
Availability
Fraction of valid requests that succeed
2xx/total requests (%)
Latency
Speed of responses (tail behavior)
P95, P99 latency (s)
Errors
Fraction of failing requests
5xx rate (%)
Throughput
Work completed per time window
requests/sec
Saturation
Resource usage relative to capacity
CPU%, queue depth
Choosing the right SLI definition is critical. Small query mismatches (e.g., wrong label, wrong endpoint, or counting retries) can produce misleading SLIs and wrong decisions.Prometheus / PromQL: common SLI query patterns
Copy
# Availability: successful requests divided by total requests (percentage)sum(rate(http_requests_total{service="api", handler="/catalog", status_code=~"2.."}[5m]))/sum(rate(http_requests_total{service="api", handler="/catalog"}[5m])) * 100# Latency: 95th/99th percentile from a histogramhistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))# Error rate: errors divided by total requestssum(rate(http_requests_total{status_code=~"5.."}[5m])) /sum(rate(http_requests_total[5m])) * 100# Throughput: rate of requests processedrate(http_requests_total[5m])# Saturation example: CPU usage as fraction of total (non-idle)sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) /sum(rate(node_cpu_seconds_total[5m]))
Choose queries that reflect your precise SLI definition — include correct labels (service, handler, endpoint) and ensure windows (e.g., 5m) match your use case. Tools like PromLens and Grafana Query Builder make constructing and validating PromQL queries easier.White box monitoring gives the most precise SLI signals: instrument your code and infrastructure so metrics, logs, and traces flow to a centralized observability stack.
White-box instrumentation answers questions like: how many requests succeed, how long they take, what resources they consume, and how a single request flows through services.Availability
Success ratio (percentage of successful requests).
Prometheus examples for availability:
Copy
# Total requests ratesum(rate(http_requests_total[5m]))# Server-side errors (5xx) ratesum(rate(http_requests_total{status_code=~"5.."}[5m]))# Success ratio (2xx / total)sum(rate(http_requests_total{status_code=~"2.."}[5m])) /sum(rate(http_requests_total[5m])) * 100
Latency
Latency SLIs focus on user-perceived speed. Use histograms and percentiles (P95, P99) rather than averages to capture tail latency. A common latency SLI:Percentage of requests faster than X ms (e.g., 300 ms)
Common latency signals:
Histogram buckets for request durations (http_request_duration_seconds_bucket).
95th/99th percentile values (histogram_quantile).
Counts of requests exceeding an unacceptable threshold.
Prometheus percentile example:
Copy
# 99th percentile latency for the /search endpoint (5m window)histogram_quantile( 0.99, sum(rate(http_request_duration_seconds_bucket{service="api", endpoint="/search"}[5m])) by (le))
Use PromLens or Grafana Query Builder to validate percentile queries and ensure the bucket set aligns with your SLI threshold.
Error, Throughput, and SaturationError SLIs measure request failures. Make explicit what counts as an error (HTTP 5xx, application-level failures, retries exhausted) and track both failure rate and success rate for different audiences (engineering vs. SLO reporting).
Throughput SLIs show how much work a system completes in a time window. Drops in throughput can indicate backpressure, queueing, or dropped messages.
Saturation SLIs reveal resource headroom. Rising saturation is an early warning — as CPU, memory, or queue depth approaches limits, latency and error rates often follow. Use saturation metrics to drive autoscaling and preventive alerts.
Applying SLIs to a real app: KodeKloud Record StoreA practical example helps anchor SLI choices. The KodeKloud Record Store API exposes endpoints like product catalog, search, order creation, order status, and background processing. Each user journey is composed of multiple steps — if one step is slow or failing, the whole journey suffers.
For the Record Store API, high-value SLIs are availability and latency. Example target: X% of search queries finish within 300 ms.
Prometheus examples for the Record Store (availability and latency):
Copy
# Availability SLI for the catalog endpoint (percentage of 2xx responses)sum(rate(http_requests_total{service="api", handler="/catalog", status_code=~"2.."}[5m]))/sum(rate(http_requests_total{service="api", handler="/catalog"}[5m])) * 100# Latency SLI (99th percentile) for the search endpointhistogram_quantile( 0.99, sum(rate(http_request_duration_seconds_bucket{service="api", endpoint="/search"}[5m])) by (le))
User journeys span endpoints (products, orders, background processing). Define SLIs for each important step to avoid blindspots: one slow endpoint can degrade the entire journey.
Ordering journey exampleWhen a user places an order, typical steps include POST /orders (create), background fulfillment (e.g., Celery), and status updates. Define SLIs for each step:
Order creation availability: percentage of POST /orders requests that succeed.
Order creation latency: how quickly the order is accepted/confirmed.
Order processing success rate: percentage of background tasks that complete successfully.
End-to-end processing time: percentage of orders processed within a target timeframe.
Prometheus examples for ordering SLIs:
Copy
# Order creation availability (percentage)sum(rate(http_requests_total{handler="/orders", method="POST", status_code=~"2.."}[5m]))/sum(rate(http_requests_total{handler="/orders", method="POST"}[5m])) * 100# Order processing success rate (Celery)sum(rate(celery_tasks_total{task_name="process_order", status="success"}[5m]))/sum(rate(celery_tasks_total{task_name="process_order"}[5m])) * 100# Percentage of orders processed within 5 seconds (example)sum(rate(order_processing_time_seconds_bucket{job="order_processor", le="5.0"}[5m]))/sum(rate(order_processing_time_seconds_count{job="order_processor"}[5m])) * 100
Collecting SLI dataCombine collection methods for robust coverage:
Application instrumentation — instrument code to expose metrics.
Expose /metrics so Prometheus can scrape application metrics. This provides precise, near real-time SLI signals from inside the service.Synthetic monitoring (example)Synthetic checks simulate user behavior and provide continuous health and latency measurements even when real traffic is low. Example bash loop:
Copy
#!/bin/bashwhile true; do echo "Performing health check..." # Check API health endpoint response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/health) if [ "$response" -eq 200 ]; then echo "API is healthy (HTTP $response)" else echo "API is unhealthy (HTTP $response)" # In a real environment, this would trigger an alert fi # Check response time (simulating user experience) start_time=$(date +%s.%N) curl -s http://localhost:8000/ > /dev/null end_time=$(date +%s.%N) duration=$(echo "$end_time - $start_time" | bc -l) printf "Response time: %.3fs\n" "$duration" if (( $(echo "$duration > 1.0" | bc -l) )); then echo "Warning: Response time exceeds 1 second" # In a real environment, this would trigger an alert fi echo "-------------------------------" sleep 30done
Use synthetic tests to validate SLOs during off-peak times and to catch regressions introduced by deployments.
Do not rely on a single data source. Combine application metrics, proxy metrics, client telemetry, and synthetic checks to avoid blindspots.
Summary and next stepsYou now have the foundations to select, define, and collect SLIs:
Pick SLIs that reflect user experience (availability, latency, errors, throughput, saturation).
Implement precise queries and validate them with tools like PromLens and Grafana.
Instrument the application and combine collection methods to ensure complete coverage.
Define SLIs for each step in important user journeys to avoid blindspots.
Next: translate SLIs into Service Level Objectives (SLOs) — actionable targets that balance user expectations with operational realities. We’ll cover SLO strategy, error budgets, and how to integrate reliability into development workflows.Further reading and references