Observability in Practice

Hey there — welcome back. In this lesson we move from theory into practice. Observability is more than collecting logs, metrics, and traces — it’s using those signals to answer real operational questions when systems misbehave: what happened, where, when, and why. This guide shows how to connect signals to real problems and turn dashboards and alerts into actionable investigations. By the end you’ll see how observability becomes actionable insight, not just data. Systems will fail in unexpected ways. Traditional monitoring (is the server up? is CPU < 80%?) often doesn’t help when users report that checkout is broken. Observability goes deeper: it lets you ask and answer new diagnostic questions using combined metrics, structured logs, and distributed traces.

A slide titled "Observability – Overview" showing a person at a laptop with an alert "Checkout page broken" alongside an iceberg diagram that contrasts shallow monitoring (server up, CPU, requests) above the water with deeper observability (what/when/where/why) below the surface. The observability section lists details like error rate spike, time after deployment, payment DB connection, and missing index in query.

Internalizing an observable mindset is essential for effective SRE work: assume unknown failures, enable unexpected questions, and focus on system behavior (not just simplistic health checks).

A presentation slide titled "Observability – Overview." It shows an "Observability Mindset" circle linked to three principles: assume unknown failures; enable unexpected questions; and understand behavior, not just health.

Hands-on: spin up the KodeKloud Record Store

We’ll run the Record Store app locally, generate traffic, and inspect the metrics, logs, and traces. Start the Compose stack (from repo root):

# From the repository root
docker-compose --env-file .env.dev up -d

You should see containers start. Example service logs (truncated):

2025-09-24T02:56:46.427879511Z caller=lifecycler.go:576 msg="instance not found in ring, adding with no tokens" ring=ingester
2025-09-24T02:56:46.474503928Z caller=scheduler.go:634 msg="scheduler is ACTIVE in the ring"
2025-09-24T02:56:52.082655597Z logger=plugin.angulardetectorsprovider.dynamic level=info msg="Restored cache from database"
2025-09-24T02:56:52.111130847Z logger=plugin.store level=info msg="Loading plugins..."

Generate test traffic and logs:

# Generate test traffic (products, orders, errors)
./test_traffic.sh

# Generate logs for correlation testing
./scripts/generate_logs.sh

# Run synthetic monitoring
./black_box_monitor.sh

Example output from the log-generation script (trimmed and cleaned):

KodeKloud Records Store - Generating Test Data for Observability
===============================================
Generating logs with trace context...
{"message":"Test spans created","trace_id":"eddcac3a6ecc42d4c8d11afb427633a0","span_id":"65c2a13799c66d7f"}
Generating error logs...
{"error":"Simulated error","trace_id":"fa9effa8010ed5787d7195da925e7efc","span_id":"cd8bd7e6a2391fcd"}
Generating 404 error...
{"detail":"Not Found"}
Creating a product...
{"name":"Vinyl Record","price":19.99,"id":4}
Creating an order...
{"message":"Order received, processing in the background","order_id":6,"task_id":"e96f86f1-6351-4e56-aa0e-03543d9379c5"}
Generating slow operation with nested spans...

The three pillars of observability

Observability generally stands on three pillars:

Metrics — tell you “what” is happening (counts, latencies, throughput).
Logs — explain “why” (context, errors, enriched fields).
Traces — show “where” time is spent across distributed services.

A slide titled "The Three Pillars – A Deep Dive" showing three colored pillar icons labeled Metrics (What happened), Logs (Why it happened), and Traces (Where it happened). The icons are blue/purple, orange, and green respectively.

Metrics: instrument, expose, scrape, visualize

Typical flow:

Define counters, histograms, and gauges in the application.
Expose them on /metrics.
Have Prometheus scrape them.
Visualize in Grafana and evaluate rules with Prometheus (Alertmanager).

Minimal example (metrics.py):

from prometheus_client import Counter, Histogram

# Basic HTTP metrics
REQUEST_COUNT = Counter(
    "http_requests_total",           # metric name
    "Total HTTP Requests",           # description
    ["method", "endpoint", "status_code"]  # labels
)

REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",  # metric name
    "HTTP Request Duration",         # description
    ["method", "endpoint"],           # labels
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]  # bucket boundaries
)

ERROR_COUNT = Counter(
    "http_request_errors_total",     # metric name
    "Total HTTP Request Errors",     # description
    ["method", "endpoint", "error_type"]  # labels
)

Application-specific metrics (excerpt from the Record Store):

from prometheus_client import Counter, Histogram, Gauge, CollectorRegistry

METRICS_REGISTRY = CollectorRegistry()

# Business-specific traffic metric
orders_operations_total = Counter(
    name='kodekloud_records_operations_total',
    documentation='Total number of record operations (CRUD)',
    labelnames=['operation', 'status'],  # operation: create, read, update, delete
    registry=METRICS_REGISTRY
)

# HTTP request duration (default buckets are suitable for most web apps)
http_request_duration_seconds = Histogram(
    name='kodekloud_http_request_duration_seconds',
    documentation='Time spent processing HTTP requests in seconds',
    labelnames=['method', 'route'],
    registry=METRICS_REGISTRY
)

# Business process latency with custom buckets
order_processing_duration_seconds = Histogram(
    name='kodekloud_order_processing_duration_seconds',
    documentation='Time taken to process an order from start to completion',
    labelnames=['order_type'],  # e.g. standard, express
    registry=METRICS_REGISTRY
)

Middleware centralizes recording metrics for each HTTP request instead of scattering metric code through business logic. Example FastAPI middleware (records counts, durations, and annotates OpenTelemetry spans):

# main.py (middleware excerpt)
import time
from fastapi import Request
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
from api.telemetry import normalize_route  # helper to normalize dynamic route segments
from api.metrics import http_requests_total, http_request_duration_seconds, http_errors_total

async def metrics_middleware(request: Request, call_next):
    start_time = time.time()
    method = request.method
    route = normalize_route(request)  # e.g., /products/{id} instead of /products/123

    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span(f"{method} {route}") as span:
        try:
            response = await call_next(request)
            status_code = response.status_code
            # Add response attributes to span
            span.set_attribute("http.status_code", status_code)
            span.set_attribute("http.response.size", int(response.headers.get("content-length", 0)))
            if status_code >= 400:
                span.set_status(Status(StatusCode.ERROR))

            # Calculate duration and record metrics
            duration = time.time() - start_time
            http_requests_total.labels(method=method, route=route, status_code=str(status_code)).inc()
            http_request_duration_seconds.labels(method=method, route=route).observe(duration)

            return response

        except Exception as exc:
            span.set_status(Status(StatusCode.ERROR))
            # Record an error metric and re-raise
            http_errors_total.labels(method=method, route=route, error_type=type(exc).__name__).inc()
            raise

Normalize dynamic route segments (e.g., /products/) for metric labels to avoid high-cardinality label explosions.

Avoid tagging metrics with high-cardinality values (user IDs, raw UUIDs). High-cardinality labels can cause large memory and storage usage in Prometheus.

PromQL examples:

# Request rate by endpoint (per 5 minutes)
sum(rate(http_requests_total[5m])) by (endpoint)

# 95th percentile response time
histogram_quantile(
  0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

You can view these metrics in Grafana dashboards for latency, throughput, error rate, and availability.

A presentation slide titled "The Three Pillars – A Deep Dive" showing icons for Metrics, Logs, and Traces and a flow from Python Code → Prometheus Metrics → Dashboards. The slide includes a Grafana dashboard screenshot displaying user-facing metrics like response time, throughput, error rate, and availability.

If you open the project in your editor you’ll find the API source files, Docker Compose, and telemetry code.

A screenshot of Visual Studio Code showing the Welcome page and Explorer sidebar for a project (kodekloud-records-store-web-app) with files like docker-compose.yaml, Dockerfile, and test_traffic.sh. The right pane displays Start options and Walkthroughs/tutorial cards.

A typical set of imports in main.py (cleaned and corrected):

# main.py (imports excerpt)
from fastapi import FastAPI, Request
from api.routes import router
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response
import logging
import json
import time
from api import models  # ensure models imported so tables exist
from api.database import engine
from api.telemetry import (
    setup_telemetry, get_tracer,
    # Metric names imported from metrics module
    http_requests_total,
    http_request_duration_seconds,
    http_errors_total,
    application_errors_total,
    active_connections,
    custom_registry,
    # helpers
    normalize_route,
    get_error_class,
)
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

Prometheus configuration (prometheus.yaml) — consolidated example:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kodekloud-record-store-api'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['api:8000']

  - job_name: 'pushgateway'
    honor_labels: true
    static_configs:
      - targets: ['pushgateway:9091']

  - job_name: 'blackbox-exporter'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['blackbox-exporter:9115']

  - job_name: 'blackbox-health'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://api:8000/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        regex: ':.*\/(.*)'
        replacement: /$1
        target_label: endpoint

You can view what Prometheus will scrape at the metrics endpoint, e.g. http://localhost:8000/metrics. Example output (truncated and cleaned):

# HELP kodekloud_http_errors_total Total number of HTTP errors
# TYPE kodekloud_http_errors_total counter
kodekloud_http_errors_total{error_code="5xx",method="GET",route="/error-test"} 1.0
kodekloud_http_errors_total{error_code="4xx",method="GET",route="/products/{id}"} 1.0
# HELP kodekloud_http_request_duration_seconds Time spent processing HTTP requests in seconds
# TYPE kodekloud_http_request_duration_seconds histogram
# HELP kodekloud_active_connections_current Current number of active connections
# TYPE kodekloud_active_connections_current gauge
kodekloud_active_connections_current 1.0

Logs: structured, enriched, and correlated

Structured JSON logs are critical for filtering, searching, and correlating with traces and metrics. Include trace_id and span_id in logs so you can join traces and logs for root-cause analysis. Example structured log entry:

{
  "timestamp": "2023-07-15T14:32:15.321Z",
  "level": "ERROR",
  "message": "Product not found during checkout",
  "trace_id": "4fd9662137ced86f5b6f59ab578c",
  "span_id": "7f42e1ca2a1d5f8b",
  "method": "POST",
  "endpoint": "/checkout",
  "product_id": 999,
  "operation": "checkout",
  "error_type": "HTTPException",
  "status_code": 404,
  "duration_ms": 1247
}

The app uses a small structured logger that enriches messages with trace context and extra fields:

# structured_logger.py
import logging
import json
from opentelemetry import trace

class StructuredLogger:
    def __init__(self, name):
        self.logger = logging.getLogger(name)
        self.logger.setLevel(logging.INFO)

    def _span_context_ids(self):
        span = trace.get_current_span()
        span_context = span.get_span_context() if span is not None else None
        trace_id = format(span_context.trace_id, "032x") if span_context and span_context.trace_id else None
        span_id = format(span_context.span_id, "016x") if span_context and span_context.span_id else None
        return trace_id, span_id

    def info(self, msg, **kwargs):
        trace_id, span_id = self._span_context_ids()
        log_data = {
            "message": msg,
            "level": "INFO",
            "trace_id": trace_id,
            "span_id": span_id,
            **kwargs
        }
        self.logger.info(json.dumps(log_data))

    def error(self, msg, **kwargs):
        trace_id, span_id = self._span_context_ids()
        log_data = {
            "message": msg,
            "level": "ERROR",
            "trace_id": trace_id,
            "span_id": span_id,
            **kwargs
        }
        self.logger.error(json.dumps(log_data))

# Usage (in main app)
structured_logger = StructuredLogger(__name__)
structured_logger.info("database_init", status="starting", action="check")
models.Base.metadata.create_all(bind=engine)
structured_logger.info("database_init", status="complete", action="tables_created")

# Initialize OpenTelemetry
setup_telemetry()

Container logs are collected by Fluent Bit (via Docker Fluentd/Fluent Bit driver), which attaches container metadata and forwards structured logs to Grafana Loki. Loki stores labeled log streams efficiently and LogQL allows queries that correlate logs to metrics and traces.

A presentation slide titled "The Three Pillars – A Deep Dive" with three pillar icons on the left labeled Metrics, Logs, and Traces. On the right it lists four structured-logging benefits: machine-parsable for analysis, consistent fields across services, rich context for debugging, and correlation with traces and metrics.

Traces: visualize the request journey

Traces show the request journey across services. Each span is a timed operation with attributes and events. Example checkout trace (summary):

Trace ID: 4fd9662137ced86f5b6f59ab578c
├─ POST /checkout: 1,347ms
│  ├─ verify_product: 134ms
│  │  └─ database-query: 128ms
│  ├─ processing_delay: 800ms ⚠ (simulated latency)
│  ├─ create_order_record: 89ms
│  │  └─ database-insert: 67ms
│  ├─ queue_background_processing: 22ms
│  └─ send_order_confirmation: 45ms

Traces combined with structured logs and metrics let you pinpoint bottlenecks and errors. Jaeger is used here for trace visualization: search by service, operation, tags, or trace ID.

A screenshot of the Jaeger UI showing a timeline scatter/bubble chart of trace durations with a search/filter sidebar on the left. Below the chart is a list of 20 traces for the service "kodekloud-record-store-api-dev," including operations like GET /health and GET /metrics.

Generate test traces:

# Create a test trace
curl http://localhost:8000/trace-test
# Create a test error (trace will include error span)
curl http://localhost:8000/error-test

Example responses:

{"message":"Test spans created","trace_id":"25d7f03dbacb55e428525dcbaa0cf081","span_id":"8434d8b79e486d12"}
{"error":"Simulated error","trace_id":"b3ebdb35dc22f6c451f823ba44025d7a","span_id":"488b6eada0e79420"}

Click a trace in Jaeger to inspect span timings, tags, and process details.

A screenshot of the Jaeger UI showing a distributed trace for "kodekloud-record-store-api-dev: GET /trace-test," with a timeline of spans, durations, and a highlighted "test-span" containing tags and process info. The view displays span bars, timing markers, and trace details like start time and total duration.

When an error occurs, expand the trace to find the error span and related logs.

A browser screenshot of the Jaeger UI showing a trace for "kodekloud-record-store-api-dev: GET /error-test" with a timeline of spans, durations, and span details. The panel shows an "error-span" entry and a warning about a duplicate tag "error:true".

The observability stack wiring

All components are wired together in Docker Compose: application services (API, worker), DB, message broker, Prometheus, Pushgateway, Grafana, Alertmanager, Loki, Fluent Bit, Jaeger, and Blackbox Exporter.

A diagram titled "Implementing Observability at KodeKloud Records Store" showing the observability stack: application services (API/FastAPI, Worker/Celery, DB/PostgreSQL, RabbitMQ), metrics & monitoring tools (Prometheus, Pushgateway, Grafana, Alertmanager, Blackbox Exporter), logging (Loki, Fluent Bit) and distributed tracing (Jaeger).

Example Compose fragments showing environment and logging driver configuration:

services:
  api:
    environment:
      OTEL_TRACES_SAMPLER: ${OTEL_TRACES_SAMPLER}
      OTEL_PROPAGATORS: "tracecontext,baggage"
      DEBUG: ${DEBUG}
      LOG_LEVEL: ${LOG_LEVEL}
      ENVIRONMENT: ${ENVIRONMENT}
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        tag: "docker.{{.Name}}"
        fluentd-async: "true"
    networks:
      - kodekloud-record-store-net

  worker:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: kodekloud-record-store-worker
    command: ["celery", "-A", "api.worker", "worker", "--loglevel=info"]
    restart: always
    depends_on:
      - rabbitmq
      - db
      - pushgateway
      - jaeger
    environment:
      POSTGRES_HOST: ${POSTGRES_HOST}
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      RABBITMQ_HOST: ${RABBITMQ_HOST}
      PROMETHEUS_PUSHGATEWAY: ${PROMETHEUS_PUSHGATEWAY}
      PYTHONPATH: ${PYTHONPATH}
      OTEL_SERVICE_NAME: kodekloud-record-store-worker
      DEBUG: ${DEBUG}
      LOG_LEVEL: ${LOG_LEVEL}
      ENVIRONMENT: ${ENVIRONMENT}
    networks:
      - kodekloud-record-store-net

From a tooling perspective, the stack includes:

Resource Type	Use Case
Prometheus	Metrics collection, rule evaluation, and alerting
Pushgateway	Short-lived/one-off jobs push metrics
Grafana	Dashboards and visualization for metrics and logs
Alertmanager	Alert routing (Slack, PagerDuty, email)
Blackbox Exporter	External synthetic probes and health checks
Fluent Bit	Container log collection and forwarding
Loki	Efficient label-based structured log storage
Jaeger / OpenTelemetry	Distributed tracing and trace visualization

A screenshot of an "Observability Tools" slide listing a metrics stack (Prometheus, Pushgateway, Grafana, Alertmanager, Blackbox Exporter) with their ports and brief descriptions. On the right is a Jaeger-like UI panel for searching and filtering traces.

A presentation slide titled "Observability Tools" listing a logging stack: Fluent Bit (Port 24224) to collect Docker logs, Loki (Port 3100) to store structured logs, and Grafana to display logs with metrics. On the right is a Jaeger UI trace search panel.

Putting it all together: a user request reaches FastAPI; middleware and instrumentation generate metrics, structured logs, and traces; Prometheus scrapes metrics and evaluates alerts; logs flow to Loki via Fluent Bit; traces flow to Jaeger via OpenTelemetry exporters. Alertmanager routes alerts to the right channels. Grafana ties metrics, logs, and traces together to support fast investigation.

A presentation slide titled "The Three Pillars – A Deep Dive" showing three pillar icons labeled Metrics, Logs, and Traces on the left and a boxed summary on the right titled "What This Trace Reveals" with notes: Total Request Time 1.35s, Bottleneck: Processing delay → 800ms, Database <150ms, and Investigation Focus: Optimize background tasks.

Wrap-up

Thanks for sticking with this practical lesson. We covered how metrics, logs, and traces are produced, collected, stored, and visualized in a concrete stack, and how they combine to help you detect, investigate, and resolve problems faster. For deeper exploration, check the resources below.

Links and references

Prometheus: https://prometheus.io/
Grafana: https://grafana.com/
Loki: https://grafana.com/oss/loki
Jaeger: https://www.jaegertracing.io/
OpenTelemetry: https://opentelemetry.io/
Fluent Bit: https://fluentbit.io/
Blackbox Exporter: https://github.com/prometheus/blackbox_exporter

You can also explore data sources and visualization fundamentals in the accompanying material.

Course Introduction

Fundamentals of SRE

Service Level Objectives and Measurements

Managing Complexity, Risk, and Toil

Incident Management

Release Engineering

Observability and Monitoring

Advanced Reliability Engineering

Bringing it All Together

Observability in Practice

Hands-on: spin up the KodeKloud Record Store

The three pillars of observability

Metrics: instrument, expose, scrape, visualize

Logs: structured, enriched, and correlated

Traces: visualize the request journey

The observability stack wiring

Wrap-up

Links and references

Watch Video

​Hands-on: spin up the KodeKloud Record Store

​The three pillars of observability

​Metrics: instrument, expose, scrape, visualize

​Logs: structured, enriched, and correlated

​Traces: visualize the request journey

​The observability stack wiring

​Wrap-up

​Links and references

Watch Video

Hands-on: spin up the KodeKloud Record Store

The three pillars of observability

Metrics: instrument, expose, scrape, visualize

Logs: structured, enriched, and correlated

Traces: visualize the request journey

The observability stack wiring

Wrap-up

Links and references