Demo Memory Stress on EKS Part 4

In this fourth installment of our chaos engineering series, we analyze the impact of injecting memory stress into a Kubernetes pod on Amazon EKS using AWS Fault Injection Simulator (FIS). Over a nine-minute interval, we applied a controlled memory fault to the “PetSite” service and collected performance metrics across multiple AWS monitoring tools.

Note

Before running any FIS experiment, ensure your IAM role has the necessary permissions for AWS FIS, CloudWatch, and X-Ray. Review the AWS FIS documentation for setup details.

1. PetSite Trace Map

We start by examining the X-Ray service map for our PetSite EC2 instance. The average latency rose from roughly 90 ms to 120 ms, and the overall request count declined slightly—evidence that the memory fault impacted service responsiveness.

The image shows an AWS CloudWatch X-Ray console displaying a trace map and metrics for an EC2 instance named "PetSite," including latency, requests, and fault rates over a selected time period.

Metric	Baseline	Under Memory Stress
Average Latency	~90 ms	~120 ms
Request Count	Normal	Decreased

While the increase in latency is measurable, our distributed Kubernetes architecture absorbs much of the fault without cascading failures.

2. Container Insights

Next, we review pod-level resource metrics in CloudWatch Container Insights. The table below summarizes CPU and memory utilization before and during the experiment.

Resource	Baseline Utilization	Under Memory Stress
CPU Utilization	0.29 cores	1.57 cores
Memory Usage	Normal	Noticeable increase

These results confirm the FIS memory injection was successful. No pod restarts occurred, indicating adequate headroom and resilience in our EKS cluster.

3. Real User Monitoring (RUM)

To assess real-world impact, we analyzed CloudWatch RUM web vitals. The proportion of “frustrating” user experiences inched up from 1.3 % to 1.4 %, still within our acceptable threshold.

The image shows an AWS CloudWatch dashboard displaying web vitals, including metrics for "Largest Contentful Paint" and "First Input Delay," with graphs indicating performance over time. The data is categorized into positive, tolerable, and frustrating experiences.

Note

Set your own RUM thresholds based on application SLAs. A small increase in “frustrating” sessions may be acceptable if overall availability remains high.

Conclusion

Despite a clear uptick in latency and resource utilization during the AWS FIS memory stress test, end-user impact remained minimal. This demonstrates the robustness of a well-architected, distributed Kubernetes cluster on Amazon EKS.

Links and References

Watch Video

Watch video content