Prometheus Certified Associate (PCA)

Observability Fundamentals

Intro to Observability

Observability is the ability to understand and measure a system's state through the data it generates. It empowers you to derive actionable insights during unexpected events in dynamic environments. Implementing observability into your application or infrastructure offers numerous benefits, including improved internal insights, faster troubleshooting, enhanced detection of hidden issues, efficient performance monitoring, and smoother cross-team collaboration. Without observability, your application behaves like a black box—accepting data and producing results without revealing the underlying processes. By "peeling back the curtains," observability shows how individual components work in unison, helping you pinpoint failures precisely when problems occur.

The image explains observability as the ability to understand and measure a system's state using generated data, highlighting its benefits like improved insights, faster troubleshooting, problem detection, and performance monitoring.

As system architectures become increasingly complex—especially with the rise of microservices—the need for effective observability grows. In a traditional monolithic application, logs and metrics are centralized. In contrast, a microservices architecture consists of multiple interconnected components, which makes troubleshooting more challenging because you must isolate the affected component, unravel the event sequence, and understand how all parts interact to cause the problem.

The image discusses the need for observability in complex system architectures, highlighting the transition from monolithic to microservices-based applications. It includes a diagram showing a monolith transitioning to microservices like email, users, and auth.

Insight

When encountering issues like increased error rates, high latency, or service timeouts, observing just the symptom isn’t enough. Effective observability helps you diagnose the underlying causes, enabling you to address both the symptoms and the root issues.

The image is a slide titled "Observability," discussing the need for more information in troubleshooting issues, with bullet points on error rates, latency, and service timeouts.

To achieve true observability, focus on three main pillars: Logging, Tracing, and Metrics.

Logging

Logs are records of events that occur within the system, capturing details such as timestamps and event messages. They are generated by operating systems, applications, databases, and more. Although logs offer a wealth of information, they can be verbose and interleaved with data from concurrent processes across various systems, making it challenging to isolate specific issues.

Below is an example of typical log entries:

Oct 26 19:35:00 ub1 kernel: [37510.942568] e1000: enp0s3 NIC Link is Down
Oct 26 19:35:00 ub1 kernel: [37510.942697] e1000 0000:00:03.0 enp0s3: Reset adapter
Oct 26 19:35:03 ub1 kernel: [37513.054072] e1000: enp0s3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Tracing

Tracing involves following the entire journey of an individual request as it traverses various systems and services. This process provides a detailed, step-by-step insight into how different components of your application interact. Each trace is identified by a unique trace ID, and individual trace events, known as spans, capture critical details such as start time, duration, and context (including parent-child relationships). These spans may be generated by components like gateways, authentication services, user management, and databases.

The image explains the concept of "Traces," showing how operations traverse through systems and services, with a diagram illustrating the flow between a gateway, authentication, user, and a database.

Each component in the request path produces a span. For instance, one span is generated at the gateway, another at the authentication layer, and additional spans follow as the request moves through successive services. A span records metrics such as the start time, duration, and its parent span, indicating the component that initiated it.

The image explains the concept of traces in a system, showing how a trace is identified by a trace-id and consists of spans that track start time and duration. It includes a diagram illustrating the flow through a gateway, authentication, and user services, with spans visualized over time.

Metrics

Metrics provide numerical data that reflect a system's state. Unlike logs—which are text-based—metrics track quantitative measures such as CPU load, number of open files, HTTP response times, and error counts. This data can be aggregated and visualized over time to identify trends, anomalies, and performance issues.

Metrics typically include four key attributes:

AttributeDescription
Metric NameA descriptive label explaining what the metric represents.
ValueThe current or most recent measure of the metric.
TimestampThe exact time at which the metric was recorded.
DimensionsAdditional tags or context that provide further insights into the metric’s meaning.

Here’s an example of a metric recorded using PromQL:

node_filesystem_avail_bytes{fstype="vfat", mountpoint="/home"} 5000
# Collected at 4:30AM on 12/1/22

The image is a slide about metrics, explaining how they provide information on system states using numerical values like CPU load and HTTP response times, and how data can be visualized to identify trends.

Prometheus and Its Role in Observability

This section highlights Prometheus, a specialized monitoring solution designed for collecting and aggregating metrics data—the metrics pillar of observability. It is important to note that Prometheus does not handle logs or traces; you will need separate applications to capture those components of observability.

The image is about Prometheus, a monitoring solution responsible for collecting and aggregating metrics. It features three sections labeled "Logs," "Metrics," and "Traces."

In summary, observability provides the means to reveal the inner workings of your system. It enables effective troubleshooting by clarifying the complex relationships between different components. As a critical part of your observability toolkit, Prometheus helps monitor and analyze the metrics that indicate your system’s overall health.

Key Takeaway

Implementing comprehensive observability practices—including logging, tracing, and metrics—is essential for managing modern, distributed systems effectively. Explore additional resources like Prometheus Documentation and Kubernetes Basics for further insights.

Watch Video

Watch video content

Previous
Course Introduction