Reliability Measurements

Welcome. This article explains how reliability is measured using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs). It also describes the telemetry and monitoring practices you need to collect meaningful data and act on it. Before you monitor or improve reliability, clarify the business goals that measurements will support. Two complementary disciplines organize those measurements:

Monitoring — provides the quantitative measurements used to check service health and SLO compliance.
Observability — provides the context (metrics, logs, traces) required to answer why things fail.

An infographic slide titled "Monitoring vs Observability — Foundation of Reliability Measurement" comparing two approaches: Monitoring (left) with an illustration of a person checking dashboards and the caption about measurements serving business goals, and Observability (right) with a person using a magnifying glass over charts and the caption about pulling metrics, logs, and traces to explain system behavior.

Core concepts: SLIs, SLOs, SLAs

These three concepts form the foundation of reliability measurement. Keep them distinct and linked:

SLAs (Service Level Agreements) — formal, often contractual promises to customers. SLAs frequently include financial or business consequences if missed. Because they are customer-facing, SLAs are typically less aggressive than internal targets.
- Example: 99.9% availability guaranteed; credits issued if availability dips below that.
SLOs (Service Level Objectives) — internal reliability targets teams set to guide engineering and operations. SLOs answer: how reliable should this service be?
- Examples: 99.9% successful requests over a 30-day window; 95% of requests complete in under 200 ms.
- SLOs are time-bound, measurable, and drive operational behavior (alerts, prioritization, error budgets).
SLIs (Service Level Indicators) — the measurable signals that reflect user experience. SLIs are the raw metrics you measure to determine SLO compliance.
- Examples: request success rate, latency percentiles, error counts, throughput.
- SLIs must be quantitative and user-focused.

SLOs are internal targets; SLAs are external promises. Set SLOs more aggressively than SLAs to maintain a buffer between internal goals and customer-facing guarantees.

Resource	Purpose	Example
SLI	Measurement that reflects user experience	`p99_latency < 500ms`, `success_rate = 99.9%`
SLO	Internal target to drive operations and decisions	99.9% success over 30 days
SLA	External, contractual guarantee	99.9% availability with financial credits on breach

A slide showing a service reliability hierarchy pyramid for SLAs, SLOs, and SLIs with short definitions (SLAs: formal commitments with business consequences; SLOs: reliability targets; SLIs: metrics measuring service reliability). A color-coded legend on the right labels them as External Promises, Internal Targets, and Measurements.

How monitoring and observability work together

Monitoring supplies the raw measurements (metrics and computed SLIs) and answers “what” — are we within thresholds, is the SLO met, is the error budget being consumed?
Observability supplies context to answer “why” — traces and logs let you debug unknown failure modes and correlate data across systems.

Monitoring tells you if the system is healthy; observability helps you determine why it is unhealthy. Both are necessary.

A slide titled "Monitoring and Observability Working Together" showing a flow from Goal → SLI → SLO. From SLO three arrows lead to green boxes labeled "Alert," "Use to make decisions," and "Create a buffer relative to the SLA."

A business goal defines direction. From that goal you pick SLIs that map to user value, define SLOs to set reliability boundaries, and then implement tooling and processes that reflect those objectives: alerts, burn-rate thresholds, error budgets, and prioritization rules balancing feature velocity and stability. Revisit SLOs as usage and application behavior evolve. Ideally, the best observability is watching real users and understanding their goals. Where that’s not practical, three telemetry types cover most needs: metrics, logs, and traces.

A slide titled "The Three Data Types for Reliability Measurements" showing three colorful triangular icons arranged in a triangle labeled Metrics (top), Traces (left) and Logs (right). The slide also shows a small "© Copyright KodeKloud" in the bottom left.

Telemetry types and their roles

Metrics — numerical measurements sampled over time. Metrics form the backbone of SLIs and SLO checks. Use metrics for trend detection, anomaly detection, and error-budget accounting.
- Common SLI metrics:
  - Request success rate / availability
  - Latency percentiles (p50, p95, p99)
  - Throughput (requests/sec)
  - Error counts
- Example Prometheus-style exposition:

# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP celery_tasks_total Number of Celery tasks executed
# TYPE celery_tasks_total counter
# HELP celery_task_failures_total Number of Celery task failures
# TYPE celery_task_failures_total counter
# HELP celery_task_duration_seconds Task execution time in seconds
# TYPE celery_task_duration_seconds histogram
# HELP http_requests_total Total HTTP Requests
# TYPE http_requests_total counter
http_requests_total{endpoint="/metrics",method="GET",status_code="200"} 15
http_requests_total{endpoint="/health",method="GET",status_code="200"} 14
http_requests_total{endpoint="/favicon.ico",method="GET",status_code="404"} 1
http_requests_total{endpoint="/docs",method="GET",status_code="200"} 1
http_requests_total{endpoint="/openapi.json",method="GET",status_code="200"} 1
# HELP http_requests_created Total HTTP Requests timestamp
# TYPE http_requests_created gauge
http_requests_created{endpoint="/metrics",method="GET",status_code="200"} 1.7449748268583556e+09
http_requests_created{endpoint="/health",method="GET",status_code="200"} 1.744974834539342e+09
http_requests_created{endpoint="/favicon.ico",method="GET",status_code="404"} 1.7449749339448135e+09
http_requests_created{endpoint="/docs",method="GET",status_code="200"} 1.7449750247689226e+09
http_requests_created{endpoint="/openapi.json",method="GET",status_code="200"} 1.7449750251944983e+09
# HELP http_request_duration_seconds HTTP Request Duration in seconds
# TYPE http_request_duration_seconds histogram

Logs — timestamped event records that provide rich context: error stacks, request/response payloads, and state transitions. Logs validate metrics and are essential for root-cause analysis. Example log entries:

> 2025-04-18 12:15:33.000 {"container_name":"/kodekloud-record-store-api","source":"stderr","log":"{\"message\": \"http_error\", \"level\": \"ERROR\", \"trace_id\": \"c7bfc8714e3720b74732fa905609705a\", \"span_id\": \"3fc595b17c3a83c8\", \"method\": \"GET\", \"endpoint\": \"/favicon.ico\", \"status_code\": 404, \"duration_ms\": 1.71}", "container_id":"2e81ab28c31116a274347a761369610ebe21e08103f2aa66cd86dd0570ac8d36"}
> 2025-04-18 12:13:37.000 {"source":"stderr","log":"{\"message\": \"Test error log\", \"level\": \"ERROR\", \"trace_id\": \"d5050e46f1e150a21145fe58b15aff89\", \"span_id\": \"cb34a453a247be9b\", \"error_type\": \"SimulatedError\", \"operation\": \"error_test\"}", "container_id":"2e81ab28c31116a274347a761369610ebe21e08103f2aa66cd86dd0570ac8d36","container_name":"/kodekloud-record-store-api"}

A presentation slide titled "The Three Data Types for Reliability Measurements" highlighting "Logs" as a data type. It lists examples relevant for SLIs: error logs, access logs, and service logs.

Traces — record the life of a single request across services and network hops. Traces are vital in distributed systems to pinpoint which service added latency or propagated an error.

A presentation slide titled "The Three Data Types for Reliability Measurements" highlighting "Traces" with three bullet examples (end-to-end request paths, service dependency maps, cross-service error propagation). Below is a trace timeline screenshot showing a GET /health request for "kodekloud-record-store-api" with span durations and service operations.

Together: metrics describe what happened, logs explain what went wrong, and traces reveal where it happened.

Minimal set of SLIs — Google’s four golden signals

If you can only measure four things, capture Google’s golden signals — essential SLIs for reliability and incident response:

Latency — time to serve a request. Use percentiles (p95, p99) to surface slow requests rather than averages.
- Example: 99% of requests complete under 200 ms.
Traffic — volume of requests (requests per second). Important for capacity planning and impact assessment.
- Example: 1,000 requests/s with under 1% errors.
Saturation — how close resources are to limits (CPU, memory, queue depth). Saturation is an early warning sign of trouble.
Errors — rate of failed requests. Define what constitutes an error for your system (HTTP 5xx, timeouts, application exceptions).

Read more in the SRE book: Monitoring Distributed Systems (Google SRE).

Black box vs white box monitoring

Black box (external) monitoring simulates user interactions and measures availability/latency from the end-user perspective: pings, synthetic transactions, page load timings.
White box (internal) monitoring exposes service-internal telemetry (metrics, logs, traces) so you can diagnose why an SLO failed.

Use both: black box shows real user impact; white box enables fast diagnosis.

A slide illustration titled "Monitoring Techniques" that compares Blackbox Monitoring (a closed black cube with arrows labeled load time, ping, API call, response time, SSH) and Whitebox Monitoring (an open box with upward arrows labeled metrics, logs, traces). The graphic visually contrasts external checks versus internal telemetry.

Measurement windows: short, medium, long

Choose measurement windows to match their operational purpose:

Short windows (minutes or hours): immediate alerting and action.
Medium windows (a few hours): detect gradual degradation and provide operational context.
Long windows (weeks to months): SLO compliance tracking and long-term reliability goals (e.g., 30-day or quarterly windows).

You need all three: short for action, medium for context, and long for compliance.

A presentation slide titled "Designing Basic Monitoring for SLIs" that compares three measurement windows for SLO monitoring — short (immediate alerting), medium (operational awareness), and long (SLO compliance tracking) — shown with colored arrows and brief descriptions. The slide is copyrighted by KodeKloud.

Common pitfalls when building SLI-based monitoring:

Using infrastructure metrics (CPU, disk) as SLIs — they rarely reflect user experience.
Using averages rather than percentiles — averages can hide tail latency that affects users.
Setting thresholds that are too sensitive (alert fatigue) or too lax (missed incidents).
Measuring in the wrong place — always measure as close to the user as possible (use synthetic and real-user monitoring).

Synthetic monitoring (black box) uses scripted, repeatable user actions — logins, API calls, page loads — to validate external availability and measure the true user experience. This completes the introduction to reliability measurements. Subsequent material will dive deeper into defining effective SLIs, designing SLOs, and applying error budgets to guide product and engineering decisions. Further reading and references:

​Core concepts: SLIs, SLOs, SLAs

​How monitoring and observability work together

​Telemetry types and their roles

​Minimal set of SLIs — Google’s four golden signals

​Black box vs white box monitoring

​Measurement windows: short, medium, long

Watch Video

Core concepts: SLIs, SLOs, SLAs

How monitoring and observability work together

Telemetry types and their roles

Minimal set of SLIs — Google’s four golden signals

Black box vs white box monitoring

Measurement windows: short, medium, long