Skip to main content
Welcome back. This lesson connects the telemetry you collect to the dashboards you use to make decisions. Observability isn’t just about collecting metrics, logs, and traces — it’s about turning that raw telemetry into actionable insights. We’ll cover the common data sources (how Prometheus, Loki, and Jaeger feed your stack) and the visualization fundamentals for building effective Grafana dashboards. Think of observability as a pipeline: application code → metrics collection → time-series storage → Grafana dashboards.
A slide titled "The Observability Data Flow" showing a left-to-right pipeline of boxes: Application Code → Metrics Collection (Prometheus) → Data Source (TSDB) → Dashboard (Grafana). It visually depicts how application metrics move from code through Prometheus into a time-series database and then into a Grafana dashboard.
Each stage in this pipeline has a specific role:
  • Application code: export metrics and add trace/log context.
  • Prometheus: scrapes metrics and stores samples in its TSDB.
  • Grafana: queries Prometheus, Loki, and Jaeger to render dashboards and correlate data.
Example metric instrumentation in a Python app (metrics.py):
# metrics.py
# Metrics are exposed at /metrics endpoint
REQUEST_COUNT.labels(method="POST", endpoint="/checkout").inc()
REQUEST_DURATION.labels(method="POST", endpoint="/checkout").observe(0.234)
Prometheus scrapes the /metrics endpoint on a configured interval (for example, every 15s). The TSDB stores series as timestamped samples:
# Data stored as time series (example)
http_requests_total{method="POST", endpoint="/checkout"} 1024 @1699123456
http_requests_total{method="POST", endpoint="/checkout"} 1127 @1699123471
Logs and traces complement metrics: Loki collects logs (with optional trace IDs), and Jaeger stores traces for distributed request analysis. Grafana can be provisioned to surface all three data sources automatically on startup.
A slide titled "Grafana Data Source Configuration" showing a four-step query flow: Grafana sends a PromQL query, Prometheus returns time-series data, Grafana renders the visualization, and the dashboard refreshes automatically (5s, 30s, 1m). The top row lists configuration steps: data source configuration, Prometheus data source setup, multiple data sources, and Grafana query flow.
Provisioning Grafana with datasources and dashboards is typically done by mounting a provisioning folder (so Grafana loads it at startup). Example docker-compose snippet:
# docker-compose.yml (excerpt)
services:
  grafana:
    image: grafana/grafana:11.5.1
    container_name: kodekloud-record-store-grafana
    restart: always
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_PATHS_PROVISIONING: /etc/grafana/provisioning
    volumes:
      - grafana_data:/var/lib/grafana
      - ./config/monitoring/grafana-provisioning:/etc/grafana/provisioning
    networks:
      - kodekloud-record-store-net
    depends_on:
      - prometheus
      - loki
      - jaeger
A typical Grafana provisioning file (config/monitoring/grafana-provisioning/datasources.yml) that registers Prometheus, Loki, and Jaeger:
# config/monitoring/grafana-provisioning/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100

  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686
    editable: true
Provision Grafana via GF_PATHS_PROVISIONING so datasources and dashboards are available at startup. Keep these provisioning files in version control for reproducible dashboards and consistent environments.
How Grafana and Prometheus interact:
  • Grafana panels issue PromQL queries to Prometheus.
  • Prometheus returns time-series samples.
  • Grafana renders those samples as line charts, stat panels, heatmaps, tables, etc.
  • Dashboards refresh on configured intervals (5s, 30s, 1m…), forming the core query-response loop of observability.
Data source roles (quick reference):
Data SourceRoleGrafana Type
PrometheusTime-series metrics collection and TSDBprometheus
LokiLogs with structured context and trace IDsloki
JaegerDistributed traces for request flow analysisjaeger
When designing dashboards, follow three pragmatic laws:
A presentation slide titled "Dashboards – Basic Design Principles" with a header "The Three Laws of SRE Dashboards." It shows three colored panels labeled "Clarity Over Beauty," "Context Over Data," and "Action Over Information," each with a short explanation about readability, baselines/targets, and enabling decision-making.
  • Clarity over beauty: during an incident, readability matters more than aesthetics.
  • Context over raw data: include baselines, SLOs, and trend lines to interpret numbers.
  • Action over information: every chart should help make a decision (link to runbooks, show drilldowns).
The dashboard layout should be layered so an engineer can answer “Is everything okay?” in about 10 seconds:
  • Top row: high-level service health overview.
  • Middle rows: performance trends and comparisons.
  • Bottom rows: root-cause drilldowns and detailed logs/traces.
A presentation slide titled "Dashboards – Basic Design Principles" showing a Grafana dashboard. It displays a "Level 1: Service Health Overview" with large green panels and charts indicating overall service health and metrics.
If the top-level status indicates an issue, drill into performance details: request rate trends, latency percentiles, error rates, and resource utilization.
A presentation slide titled "Dashboards – Basic Design Principles" with the subtitle "Level 2: Service Performance Details — Investigate trends and issues." It shows a dark monitoring dashboard with charts for request rate by endpoint, error rate by status code, and response time (P50, P95).
Choose chart types deliberately — pick visuals that clarify trends and enable decisions.
A slide titled "Chart Types for SRE Metrics" recommending time-series (line) charts for request rates, latency and error rates because they show trends and spikes. On the right is a monitoring dashboard screenshot with response-time and CPU/memory usage graphs.
Use this quick mapping to choose visuals:
Chart TypeBest ForWhy
Time series (line)Request rates, latency percentiles, error ratesShows trends and spikes clearly
Stat / single-valueUptime, SLO compliance, P95Immediate at-a-glance status
Histogram / heatmapResponse-time distributionsReveals outliers that percentiles hide
Bar chartComparative error rates or request volumesSimple comparison across groups
A slide titled "Chart Types for SRE Metrics" with a "Status/Health" list showing Best Chart: Stat panel, Use Case: Service health/SLOs, and Why: Immediate visual status. To the right is a green tiled dashboard showing multiple P95 response time panels (95.0 ms).
A presentation slide titled "Chart Types for SRE Metrics" recommending heatmap/histogram for distribution data, with use case "response times" and reason "reveals real user experience." On the right is a dark-themed heatmap showing response time percentiles (average, P95, P99) over time.
A slide titled "Chart Types for SRE Metrics" showing a left panel of comparative data (01 Best Chart: Bar chart; 02 Use Case: Error rates by service; 03 Why: Easy comparison) and a right-side dark dashboard widget labeled "Error Rate SLO" with 0% error bars for several time windows.
Avoid charts that obscure trends: pie charts for time series, 3D effects, excessive color palettes, and tiny fonts make dashboards difficult to read during incidents.
A presentation slide titled "Chart Types for SRE Metrics" showing "What NOT to Use" with four numbered boxes warning against pie charts (not for time series), 3D charts (confusing), too many colors (hard to distinguish), and tiny fonts (unreadable during incidents).
Start a dashboard with a clear purpose. For the KodeKloud record store, the target is a service health overview that helps engineers quickly identify services needing attention. Critical metrics: availability, request rate, error rate, and response time (P95). Example PromQL queries for these metrics:
# Availability (up = 1 means target is reachable)
up{job="kodekloud-record-store-api"}

# Request rate (per second over 5m)
rate(http_requests_total[5m])

# Error rate (5xx responses per second)
rate(http_requests_total{status_code=~"5.."}[5m])

# Response time (P95) from histogram buckets
histogram_quantile(
  0.95,
  rate(kodekloud_http_request_duration_seconds_bucket[5m])
)
Layout guidance:
  • Top row: large status numbers and SLO indicators.
  • Middle rows: trends, comparisons, and resource usage.
  • Bottom rows: detailed breakdowns, tables, logs, and trace links. Add SLO overlays and threshold-based coloring so panels are green/yellow/red according to SLO boundaries.
A presentation slide titled "Building Your First Effective Dashboard" listing three steps (Define Your Purpose; Choose Your Metrics; Layout and Organization) with the third highlighted. To the right is a mock dashboard showing top metrics (99.8% uptime, 1.2K RPS, 120ms) and panels for Request Rate, Error Rate, Response Times, and Error Breakdown.
Example SLO targets for the record-store service:
# SLO Targets
availability: 99.9%
latency:
  p95: "< 500ms"
error_rate: "< 1%"
Validate dashboards under realistic load—generate synthetic traffic and errors to ensure alerts, color thresholds, and drilldowns behave as intended. To generate logs and traces for local testing:
# Generate comprehensive test data including errors and traffic
./scripts/generate_logs.sh
Example output after running the test generator (trimmed):
$ ./scripts/generate_logs.sh
KodeKloud Records Store - Generating Test Data for Observability
========================================
Generating logs with trace context...
{"message":"Test spans created","trace_id":"23e3b9c0012b79fe8e15de6d5babaef5","span_id":"e1cca1933c5ac72d"}
Generating error logs...
{"error":"Simulated error","trace_id":"eb29cf3893f8a137d5b1e47d2d482961","span_id":"5a6d3185f84fc29f"}
Generating 404 error...
{"detail":"Not Found"}
Generating product listing logs...
[{"name":"Vinyl Record","price":19.99,"id":1},{"name":"Vinyl Record","price":19.99,"id":2},{"name":"Abbey Road","price":25.99,"id":3}]
Open Grafana at http://localhost:3000 to explore provisioned dashboards. The demo dashboards include observability overviews, user-experience widgets, and an end-to-end purchase journey with links to detail dashboards.
A screenshot of the Grafana web UI showing the Dashboards menu and a list of KodeKloud dashboard entries. Large metric panels with big numeric values (e.g., 0.00351, 100, 0.0739) are visible along the bottom.
Logs often include trace context so you can pivot from a log line into a Jaeger trace and back to metrics. Example structured log line (JSON):
{
  "log":"{\"message\":\"http_error\",\"level\":\"ERROR\",\"trace_id\":\"7392f0da7857dbb33b493e4be7a9ba21\",\"span_id\":\"08d85bde2a104cf7\",\"method\":\"GET\",\"route\":\"/products/{id}\",\"status_code\":404,\"error_class\":\"4xx\",\"duration_ms\":5.0}",
  "container_id":"b9b82e5e5af0b6794b809f86c44953f3fbf08344c322dc58c371997d3d87576",
  "container_name":"/kodekloud-records"
}
A screenshot of a Grafana dashboard showing performance metrics with multiple P95 response-time tiles, a requests-per-second panel, a red "100%" panel and a green "UP" panel with a checkmark. A response-time line chart is visible at the bottom and a dark navigation sidebar is on the left.
The P95 panel uses a PromQL expression like:
# P95 response time in milliseconds
histogram_quantile(0.95, rate(kodekloud_http_request_duration_seconds_bucket[5m])) * 1000
Grafana dashboards and panels are stored in the provisioning folder (datasources and dashboards). Inspect the JSON to see how queries, thresholds, units, links, and alerts are configured. Example panel field configuration (thresholds and unit):
"panels": [
  {
    "fieldConfig": {
      "defaults": {
        "thresholds": {
          "steps": [
            { "color": "green", "value": null },
            { "color": "yellow", "value": 0.02 },
            { "color": "red", "value": 0.05 }
          ]
        },
        "unit": "percent"
      },
      "overrides": []
    },
    "gridPos": { "h": 8, "w": 6, "x": 12, "y": 0 }
  }
]
Explore the provisioning folder, copy panels into your own dashboards, tweak units and thresholds, and validate under synthetic load to ensure panels and alerts behave as expected during incidents.
Do not store plaintext credentials in provisioning files. Use environment variables or a secrets manager (for example, GF_SECURITY_ADMIN_PASSWORD via Docker Compose) and restrict access to provisioning folders.
Links and references Further reading and examples: Use these patterns to build reproducible, testable dashboards that surface the right signals at the right time.