KodeKloud Notes

In this article, we explain the best practices for naming your metrics to ensure consistency, clarity, and ease in tracking. A standardized naming convention makes it easier to understand and interpret the data collected from various systems.

Naming Convention

Metric names must be written in snake_case, meaning all letters are lowercase and words are separated by underscores. For instance, the metric name http_requests_total follows this convention.

The structure for naming metrics should be:

The first term represents the application or library associated with the metric. For example, metrics related to PostgreSQL should start with postgresql_.
Subsequent terms describe what the metric measures, such as queue_size.
Always append the unit of measurement (e.g., seconds, bytes, meters) to avoid misinterpretation. This ensures clarity, such as distinguishing between seconds and milliseconds.
Use unprefixed base units (like seconds, bytes, meters) rather than their prefixed counterparts (such as microseconds or kilobytes).
Avoid applying special suffixes like _total, _count, _sum, and _bucket to custom names except that counter metrics should end with _total. Other metric types, including histograms, should not use these suffixes unless required.

The standard naming format includes the library name, a description, a unit, and, where applicable, an appropriate suffix.

The image provides guidelines for naming metrics, emphasizing the inclusion of units in metric names and recommending the use of unprefixed base units like seconds, bytes, and meters. It also advises against using microseconds or kilobytes and mentions suffixes like _total for counter metrics.

Examples of Metric Names

Below are some well-crafted examples that adhere to these conventions:

process_cpu_seconds
http_requests_total
redis_connection_errors

process_cpu_seconds uses snake_case, begins with the application/library (process), and includes the unit seconds.
http_requests_total starts with the relevant component (http), describes the metric (requests), and appropriately ends with _total for a counter metric.
redis_connection_errors clearly identifies the system (Redis) and describes the error type.

Additional Guidance

For tracking connection errors as a counter metric, you might use redis_connection_errors_total. In the case of node_disk_read_bytes_total, the name effectively highlights the source (Node), the measured metric (disk read bytes), and marks it as a counter with _total.

Conversely, avoid names that deviate from these guidelines:

Bad Example: container Docker restarts
Recommendation: Use snake_case and place the library name first. Instead, use docker_container_restarts.
Bad Example: HTTP_request_sum
Recommendation: Do not use terms like sum which could lead to confusion.
Bad Example: nginx_disk_free_kilobytes
Recommendation: Replace kilobytes with the base unit bytes.
Bad Example: .NET queue waiting time
Recommendation: Always include the unit for clarity.

The image lists examples of proper and incorrect metric names, with proper names on the left and incorrect ones on the right.

What to Instrument

Choosing what to instrument depends on your system's type and its requirements. Metrics should be tailored to the specific operational context. Generally, there are three main types of applications:

1. Online Serving Systems

Online serving systems require immediate responses. They include components such as databases, web servers, and APIs. Common metrics for these systems include:

Total number of requests or queries
Number of errors
Latency measurements
Number of in-progress requests

The image describes an online-serving system that requires immediate responses, such as databases and web servers, and lists metrics to monitor: number of queries/requests, number of errors, latency, and number of in-progress requests.

2. Offline Processing Services

Offline processing services are used where immediate responses are not required. These systems typically perform batch processes involving multiple stages. Metrics to consider include:

Total amount of work to be done
Volume of queued work
Number of work items in progress
Processing rates
Errors at various processing stages

The image is a slide discussing offline processing services, highlighting the need to measure the amount of queued work, work in progress, and the rate of processing for each stage.

3. Batch Jobs

Batch jobs are scheduled to run at specific intervals rather than continuously. Because batch jobs do not run continuously, using a Push Gateway is often recommended for effective data collection. Key metrics for batch jobs should include:

Time spent processing each stage of the job
Overall runtime of the job
Timestamp of the last job completion

The image contains text explaining that batch jobs are similar to offline-serving systems but run on a regular schedule, requiring a pushGateway because they aren't continuously running.

Final Thoughts

Implementing these best practices ensures that your metrics are consistently named and accurately monitored, ultimately improving observability and simplifying troubleshooting across your systems.

Watch Video

Watch video content

Practice Lab

Practice lab