Demo Memory Stress on EKS Part 2

In this lesson, we’ll establish a steady-state baseline for our Amazon EKS application by collecting metrics from three AWS observability tools. This prepares us to measure the impact of our Fault Injection Service (FIS) memory‐stress experiment.

Note

Establishing a steady-state baseline is crucial before running any chaos experiment. It helps you distinguish normal behavior from fault-induced anomalies.

Observability Tools and Key Metrics

Observability Tool	Focus	Key Metrics
CloudWatch Container Insights	Cluster-level	CPU & memory utilization, alarms
CloudWatch Performance Dashboard	Service-level	Running pods, CPU utilization, memory use
CloudWatch RUM	End-user metrics	Largest Contentful Paint (LCP), First Input Delay (FID), UX ratings

1. CloudWatch Container Insights

To begin, navigate to the CloudWatch Container Insights dashboard and select your EKS cluster. Here you can view overall CPU and memory utilization, cluster state summaries, and alarm statuses.

The image shows an AWS CloudWatch Container Insights dashboard for Amazon EKS, displaying cluster state summaries, performance metrics, and alarm states.

This baseline snapshot reveals how your cluster performs under normal conditions.

2. Service-Level Performance Dashboard

Next, go to the Services section under CloudWatch performance dashboards. Wait for the metrics to load, then review:

Number of running pods
Pod CPU utilization
Pod memory utilization

The image shows an AWS CloudWatch dashboard for monitoring service performance, displaying metrics like the number of running pods, CPU utilization, and memory utilization for a service named "PetSite."

Inspect the time-series graphs to see how these values evolve in real time.

The image shows an AWS CloudWatch dashboard displaying performance metrics for various services, including graphs of pod CPU utilization and a list of services with their average values.

3. Real User Monitoring (RUM)

For end-user experience, use CloudWatch RUM. Select your PetSite RUM app monitor to view session quality:

Positive
Tolerable
Frustrating

The current “Frustrating” rate is 0.9%, indicating most user sessions are performing well.

The image shows a dashboard from AWS CloudWatch displaying metrics for "Largest Contentful Paint" and "First Input Delay," with graphs indicating performance over several days. The metrics are categorized into positive, tolerable, and frustrating levels.

4. Page Load Metrics Overview

Finally, review the page load times and Cumulative Layout Shift (CLS) trends to understand the front-end impact before fault injection.

The image shows an AWS CloudWatch dashboard displaying metrics related to page load times and cumulative layout shift, with graphs indicating performance over several days in July 2024. The sidebar includes options for logs, metrics, and application signals.

Next Steps

In the next demo, we’ll execute our FIS memory‐stress experiment and revisit these dashboards to observe how injected faults affect cluster health, service performance, and user experience.

Links and References

Watch Video

Watch video content