Practical guide to implementing observability with metrics, logs, and traces using a sample app and a Prometheus Grafana Loki Jaeger stack
Hey there — welcome back.In this lesson we move from theory into practice. Observability is more than collecting logs, metrics, and traces — it’s using those signals to answer real operational questions when systems misbehave: what happened, where, when, and why. This guide shows how to connect signals to real problems and turn dashboards and alerts into actionable investigations. By the end you’ll see how observability becomes actionable insight, not just data.Systems will fail in unexpected ways. Traditional monitoring (is the server up? is CPU < 80%?) often doesn’t help when users report that checkout is broken. Observability goes deeper: it lets you ask and answer new diagnostic questions using combined metrics, structured logs, and distributed traces.
Internalizing an observable mindset is essential for effective SRE work: assume unknown failures, enable unexpected questions, and focus on system behavior (not just simplistic health checks).
We’ll run the Record Store app locally, generate traffic, and inspect the metrics, logs, and traces.Start the Compose stack (from repo root):
# From the repository rootdocker-compose --env-file .env.dev up -d
You should see containers start. Example service logs (truncated):
2025-09-24T02:56:46.427879511Z caller=lifecycler.go:576 msg="instance not found in ring, adding with no tokens" ring=ingester2025-09-24T02:56:46.474503928Z caller=scheduler.go:634 msg="scheduler is ACTIVE in the ring"2025-09-24T02:56:52.082655597Z logger=plugin.angulardetectorsprovider.dynamic level=info msg="Restored cache from database"2025-09-24T02:56:52.111130847Z logger=plugin.store level=info msg="Loading plugins..."
Generate test traffic and logs:
# Generate test traffic (products, orders, errors)./test_traffic.sh# Generate logs for correlation testing./scripts/generate_logs.sh# Run synthetic monitoring./black_box_monitor.sh
Example output from the log-generation script (trimmed and cleaned):
KodeKloud Records Store - Generating Test Data for Observability===============================================Generating logs with trace context...{"message":"Test spans created","trace_id":"eddcac3a6ecc42d4c8d11afb427633a0","span_id":"65c2a13799c66d7f"}Generating error logs...{"error":"Simulated error","trace_id":"fa9effa8010ed5787d7195da925e7efc","span_id":"cd8bd7e6a2391fcd"}Generating 404 error...{"detail":"Not Found"}Creating a product...{"name":"Vinyl Record","price":19.99,"id":4}Creating an order...{"message":"Order received, processing in the background","order_id":6,"task_id":"e96f86f1-6351-4e56-aa0e-03543d9379c5"}Generating slow operation with nested spans...
Application-specific metrics (excerpt from the Record Store):
from prometheus_client import Counter, Histogram, Gauge, CollectorRegistryMETRICS_REGISTRY = CollectorRegistry()# Business-specific traffic metricorders_operations_total = Counter( name='kodekloud_records_operations_total', documentation='Total number of record operations (CRUD)', labelnames=['operation', 'status'], # operation: create, read, update, delete registry=METRICS_REGISTRY)# HTTP request duration (default buckets are suitable for most web apps)http_request_duration_seconds = Histogram( name='kodekloud_http_request_duration_seconds', documentation='Time spent processing HTTP requests in seconds', labelnames=['method', 'route'], registry=METRICS_REGISTRY)# Business process latency with custom bucketsorder_processing_duration_seconds = Histogram( name='kodekloud_order_processing_duration_seconds', documentation='Time taken to process an order from start to completion', labelnames=['order_type'], # e.g. standard, express registry=METRICS_REGISTRY)
Middleware centralizes recording metrics for each HTTP request instead of scattering metric code through business logic.Example FastAPI middleware (records counts, durations, and annotates OpenTelemetry spans):
# main.py (middleware excerpt)import timefrom fastapi import Requestfrom opentelemetry import tracefrom opentelemetry.trace import Status, StatusCodefrom api.telemetry import normalize_route # helper to normalize dynamic route segmentsfrom api.metrics import http_requests_total, http_request_duration_seconds, http_errors_totalasync def metrics_middleware(request: Request, call_next): start_time = time.time() method = request.method route = normalize_route(request) # e.g., /products/{id} instead of /products/123 tracer = trace.get_tracer(__name__) with tracer.start_as_current_span(f"{method} {route}") as span: try: response = await call_next(request) status_code = response.status_code # Add response attributes to span span.set_attribute("http.status_code", status_code) span.set_attribute("http.response.size", int(response.headers.get("content-length", 0))) if status_code >= 400: span.set_status(Status(StatusCode.ERROR)) # Calculate duration and record metrics duration = time.time() - start_time http_requests_total.labels(method=method, route=route, status_code=str(status_code)).inc() http_request_duration_seconds.labels(method=method, route=route).observe(duration) return response except Exception as exc: span.set_status(Status(StatusCode.ERROR)) # Record an error metric and re-raise http_errors_total.labels(method=method, route=route, error_type=type(exc).__name__).inc() raise
Normalize dynamic route segments (e.g., /products/) for metric labels to avoid high-cardinality label explosions.
Avoid tagging metrics with high-cardinality values (user IDs, raw UUIDs). High-cardinality labels can cause large memory and storage usage in Prometheus.
PromQL examples:
# Request rate by endpoint (per 5 minutes)sum(rate(http_requests_total[5m])) by (endpoint)# 95th percentile response timehistogram_quantile( 0.95, rate(http_request_duration_seconds_bucket[5m]))
You can view these metrics in Grafana dashboards for latency, throughput, error rate, and availability.
If you open the project in your editor you’ll find the API source files, Docker Compose, and telemetry code.
A typical set of imports in main.py (cleaned and corrected):
You can view what Prometheus will scrape at the metrics endpoint, e.g. http://localhost:8000/metrics. Example output (truncated and cleaned):
# HELP kodekloud_http_errors_total Total number of HTTP errors# TYPE kodekloud_http_errors_total counterkodekloud_http_errors_total{error_code="5xx",method="GET",route="/error-test"} 1.0kodekloud_http_errors_total{error_code="4xx",method="GET",route="/products/{id}"} 1.0# HELP kodekloud_http_request_duration_seconds Time spent processing HTTP requests in seconds# TYPE kodekloud_http_request_duration_seconds histogram# HELP kodekloud_active_connections_current Current number of active connections# TYPE kodekloud_active_connections_current gaugekodekloud_active_connections_current 1.0
Structured JSON logs are critical for filtering, searching, and correlating with traces and metrics. Include trace_id and span_id in logs so you can join traces and logs for root-cause analysis.Example structured log entry:
{ "timestamp": "2023-07-15T14:32:15.321Z", "level": "ERROR", "message": "Product not found during checkout", "trace_id": "4fd9662137ced86f5b6f59ab578c", "span_id": "7f42e1ca2a1d5f8b", "method": "POST", "endpoint": "/checkout", "product_id": 999, "operation": "checkout", "error_type": "HTTPException", "status_code": 404, "duration_ms": 1247}
The app uses a small structured logger that enriches messages with trace context and extra fields:
Container logs are collected by Fluent Bit (via Docker Fluentd/Fluent Bit driver), which attaches container metadata and forwards structured logs to Grafana Loki. Loki stores labeled log streams efficiently and LogQL allows queries that correlate logs to metrics and traces.
Traces combined with structured logs and metrics let you pinpoint bottlenecks and errors. Jaeger is used here for trace visualization: search by service, operation, tags, or trace ID.
Generate test traces:
# Create a test tracecurl http://localhost:8000/trace-test# Create a test error (trace will include error span)curl http://localhost:8000/error-test
Putting it all together: a user request reaches FastAPI; middleware and instrumentation generate metrics, structured logs, and traces; Prometheus scrapes metrics and evaluates alerts; logs flow to Loki via Fluent Bit; traces flow to Jaeger via OpenTelemetry exporters. Alertmanager routes alerts to the right channels. Grafana ties metrics, logs, and traces together to support fast investigation.
Thanks for sticking with this practical lesson. We covered how metrics, logs, and traces are produced, collected, stored, and visualized in a concrete stack, and how they combine to help you detect, investigate, and resolve problems faster. For deeper exploration, check the resources below.