Skip to main content
Hello and welcome back. In this lesson we focus on alerting and monitoring for managed relational databases (AWS RDS). Monitoring is critical: databases power most applications, and slow or unavailable databases quickly degrade user experience and increase costs. A layered monitoring strategy — metrics, logs, and query-level visibility — helps you detect issues early, diagnose root causes, and right-size infrastructure.

Why monitor a cloud database?

  • Proactive issue detection: surface small problems before they become outages with alerts and anomaly detection.
  • Performance optimization: identify slow SQL, missing indexes, and resource bottlenecks to improve latency and throughput.
  • Availability and reliability: track failover events, replica lag, and health checks to maintain SLAs.
  • Capacity planning and cost control: use trends (CPU, memory, connections, I/O) to scale appropriately and reduce waste.
Example scenario
  • If the DB CPU utilization consistently hovers around 80% on a general-purpose instance, monitoring shows evidence to move to a compute-optimized instance or scale read replicas. Trends and correlated query data make these decisions measurable rather than speculative.
Enable Performance Insights and relevant query logging (slow query logs, pg_stat_statements) to gain SQL-level context. These features improve troubleshooting but can add cost and storage overhead depending on retention.

What to monitor

Track a mix of infrastructure and database-specific metrics, plus logs and query diagnostics.
Metric / SignalPurpose / What it indicatesRecommended alert guidance
CPU utilizationHigh sustained CPU -> consider instance type or query optimizationAlert when > 75% for 5–15 minutes
Memory usage & swapMemory pressure often leads to latency and swapping (requires Enhanced Monitoring for OS-level metrics)Alert when free memory low or swap used
Read/write IOPS & latencyStorage contention causes slow queries and timeoutsAlert when I/O latency or queue depth spikes
Disk free space & throughputPrevent outages from full disksAlert when free space < X% or < Y GB
DB connections & churnToo many connections or connection storms can exhaust resourcesAlert on sustained connection count near max
Query latencies / slow queriesIdentifies long-running or inefficient statementsCapture slow query logs and alert on threshold breaches
Failover / replication healthReplica lag or failed failovers affect HA and read scalingAlert on replica lag > threshold or instance failover events

AWS tools for monitoring RDS

  • Amazon CloudWatch
    Collects core RDS metrics (CPU, IOPS, latency, free storage, DB connections). Use CloudWatch Alarms and Dashboards for alerting and visual summaries. See CloudWatch metrics for Amazon RDS: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/monitoring-cloudwatch.html
  • RDS Enhanced Monitoring
    Provides host-level (OS) metrics such as memory and swap. These metrics are published to CloudWatch Logs and are useful for diagnosing memory pressure and OS-level signals. More: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html
  • RDS Performance Insights
    Offers a DB-optimized dashboard showing load, top SQL, waits, and historical patterns. Enable per instance for deep SQL-level visibility. More: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html
  • Database engine logs and query instrumentation
    • MySQL: general and slow query logs
    • PostgreSQL: pg_stat_statements, log_min_duration_statement
      Export logs to CloudWatch Logs or S3 for retention, searching, and integration with alerting.
  • RDS Recommendations & Trusted Advisor
    RDS console may surface sizing or configuration recommendations; review these to improve cost and performance.

Quick examples (AWS CLI)

Create a CloudWatch alarm for CPU utilization:
Output:
bash
aws cloudwatch put-metric-alarm \
  --alarm-name "RDS-High-CPU" \
  --metric-name CPUUtilization \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=my-db-instance \
  --statistic Average \
  --period 300 --evaluation-periods 3 \
  --threshold 75 --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:notify-team
Enable Enhanced Monitoring on a DB instance:
CLI
aws rds modify-db-instance \
  --db-instance-identifier my-db-instance \
  --monitoring-interval 60 \
  --monitoring-role-arn arn:aws:iam::123456789012:role/rds-monitoring-role \
  --apply-immediately

Alerts and playbooks

Design alerts to be actionable and reduce noise:
  • Prefer multi-period or anomaly-based alerts to avoid false positives (e.g., sustained 5–15 minutes).
  • Correlate alerts across metrics (CPU + I/O + slow queries) to prioritize root cause.
  • Create runbooks for common alerts: slow queries, replica lag, full storage, connection storms, and failover events.
  • Channel alerts to the right teams (DBA, platform, on-call) and include recovery steps.
Enabling Performance Insights, increased log retention, or high-frequency monitoring can increase costs. Review retention settings and retention policies to balance visibility and expense.

Monitoring goals (summary)

  • Detect anomalies before users are impacted.
  • Reduce mean time to resolution (MTTR) using correlated metrics and query-level data.
  • Maintain availability through automated failover monitoring and replica health checks.
  • Optimize capacity and costs with trend-driven scaling decisions.
To practice these concepts, run hands-on labs that:
  • Enable Enhanced Monitoring and Performance Insights on an RDS instance.
  • Configure CloudWatch Alarms and Dashboards.
  • Capture slow query logs and analyze top offenders.
  • Execute an incident runbook for a simulated replication lag or CPU spike.
A slide showing four AWS RDS monitoring/optimization steps: set up CloudWatch integrations, enable AWS RDS Performance Insights, capture long-running query logs, and adapt RDS recommendations. The items are displayed as four rounded cards with colorful icons and a © Copyright KodeKloud note.
Next steps: enable monitoring on a test instance, create a dashboard and alerts, and review slow query reports to build a prioritized remediation list.

Watch Video