Skip to main content

Documentation Index

Fetch the complete documentation index at: https://notes.kodekloud.com/llms.txt

Use this file to discover all available pages before exploring further.

Hello and welcome back. In this lesson we focus on alerting and monitoring for managed relational databases (AWS RDS). Monitoring is critical: databases power most applications, and slow or unavailable databases quickly degrade user experience and increase costs. A layered monitoring strategy — metrics, logs, and query-level visibility — helps you detect issues early, diagnose root causes, and right-size infrastructure.

Why monitor a cloud database?

  • Proactive issue detection: surface small problems before they become outages with alerts and anomaly detection.
  • Performance optimization: identify slow SQL, missing indexes, and resource bottlenecks to improve latency and throughput.
  • Availability and reliability: track failover events, replica lag, and health checks to maintain SLAs.
  • Capacity planning and cost control: use trends (CPU, memory, connections, I/O) to scale appropriately and reduce waste.
Example scenario
  • If the DB CPU utilization consistently hovers around 80% on a general-purpose instance, monitoring shows evidence to move to a compute-optimized instance or scale read replicas. Trends and correlated query data make these decisions measurable rather than speculative.
Enable Performance Insights and relevant query logging (slow query logs, pg_stat_statements) to gain SQL-level context. These features improve troubleshooting but can add cost and storage overhead depending on retention.

What to monitor

Track a mix of infrastructure and database-specific metrics, plus logs and query diagnostics.
Metric / SignalPurpose / What it indicatesRecommended alert guidance
CPU utilizationHigh sustained CPU -> consider instance type or query optimizationAlert when > 75% for 5–15 minutes
Memory usage & swapMemory pressure often leads to latency and swapping (requires Enhanced Monitoring for OS-level metrics)Alert when free memory low or swap used
Read/write IOPS & latencyStorage contention causes slow queries and timeoutsAlert when I/O latency or queue depth spikes
Disk free space & throughputPrevent outages from full disksAlert when free space < X% or < Y GB
DB connections & churnToo many connections or connection storms can exhaust resourcesAlert on sustained connection count near max
Query latencies / slow queriesIdentifies long-running or inefficient statementsCapture slow query logs and alert on threshold breaches
Failover / replication healthReplica lag or failed failovers affect HA and read scalingAlert on replica lag > threshold or instance failover events

AWS tools for monitoring RDS

  • Amazon CloudWatch
    Collects core RDS metrics (CPU, IOPS, latency, free storage, DB connections). Use CloudWatch Alarms and Dashboards for alerting and visual summaries. See CloudWatch metrics for Amazon RDS: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/monitoring-cloudwatch.html
  • RDS Enhanced Monitoring
    Provides host-level (OS) metrics such as memory and swap. These metrics are published to CloudWatch Logs and are useful for diagnosing memory pressure and OS-level signals. More: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html
  • RDS Performance Insights
    Offers a DB-optimized dashboard showing load, top SQL, waits, and historical patterns. Enable per instance for deep SQL-level visibility. More: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html
  • Database engine logs and query instrumentation
    • MySQL: general and slow query logs
    • PostgreSQL: pg_stat_statements, log_min_duration_statement
      Export logs to CloudWatch Logs or S3 for retention, searching, and integration with alerting.
  • RDS Recommendations & Trusted Advisor
    RDS console may surface sizing or configuration recommendations; review these to improve cost and performance.

Quick examples (AWS CLI)

Create a CloudWatch alarm for CPU utilization:
Output:
bash
aws cloudwatch put-metric-alarm \
  --alarm-name "RDS-High-CPU" \
  --metric-name CPUUtilization \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=my-db-instance \
  --statistic Average \
  --period 300 --evaluation-periods 3 \
  --threshold 75 --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:notify-team
Enable Enhanced Monitoring on a DB instance:
CLI
aws rds modify-db-instance \
  --db-instance-identifier my-db-instance \
  --monitoring-interval 60 \
  --monitoring-role-arn arn:aws:iam::123456789012:role/rds-monitoring-role \
  --apply-immediately

Alerts and playbooks

Design alerts to be actionable and reduce noise:
  • Prefer multi-period or anomaly-based alerts to avoid false positives (e.g., sustained 5–15 minutes).
  • Correlate alerts across metrics (CPU + I/O + slow queries) to prioritize root cause.
  • Create runbooks for common alerts: slow queries, replica lag, full storage, connection storms, and failover events.
  • Channel alerts to the right teams (DBA, platform, on-call) and include recovery steps.
Enabling Performance Insights, increased log retention, or high-frequency monitoring can increase costs. Review retention settings and retention policies to balance visibility and expense.

Monitoring goals (summary)

  • Detect anomalies before users are impacted.
  • Reduce mean time to resolution (MTTR) using correlated metrics and query-level data.
  • Maintain availability through automated failover monitoring and replica health checks.
  • Optimize capacity and costs with trend-driven scaling decisions.
To practice these concepts, run hands-on labs that:
  • Enable Enhanced Monitoring and Performance Insights on an RDS instance.
  • Configure CloudWatch Alarms and Dashboards.
  • Capture slow query logs and analyze top offenders.
  • Execute an incident runbook for a simulated replication lag or CPU spike.
A slide showing four AWS RDS monitoring/optimization steps: set up CloudWatch integrations, enable AWS RDS Performance Insights, capture long-running query logs, and adapt RDS recommendations. The items are displayed as four rounded cards with colorful icons and a © Copyright KodeKloud note.
Next steps: enable monitoring on a test instance, create a dashboard and alerts, and review slow query reports to build a prioritized remediation list.

Watch Video