CloudWatch Alarms and Metrics: Setting Up Meaningful Monitoring on AWS
Most teams set up too many low-signal CloudWatch alarms or too few. Here's QuickInfra's opinionated approach to setting up monitoring that actually tells you something useful when something breaks.
QuickInfra Team
QuickInfra Cloud Solution
Good monitoring tells you something is wrong before your users do, and gives you enough context to know where to look. Bad monitoring either floods you with noise until you start ignoring alerts, or misses real problems because the thresholds are calibrated to be permissive. QuickInfra's monitoring configuration aims for the former.
The Signal vs Noise Problem
An EC2 instance sending a CPU alarm every time it spikes to 80% during normal batch processing is a noise alarm. It fires frequently, nothing bad happens, and the team learns to ignore it. When the instance eventually has a genuine problem — stuck process, runaway thread — the alarm fires again and nobody acts on it.
QuickInfra's approach: alarm on anomaly, not on threshold. For CPU, the alarm fires when current utilisation deviates significantly from the 30-day baseline for that specific instance at that time of day — not when it crosses a fixed number.
The Standard Alarm Set
For each EC2 instance, QuickInfra configures:
- CPU Utilisation — anomaly detection model
- StatusCheckFailed — fires when the EC2 status check fails (immediate action required)
- Disk Utilisation — threshold at 85%, predictive alarm if trending to 85% within 6 hours
- Memory Utilisation — threshold at 80% (requires CloudWatch Agent)
- Network In/Out — anomaly detection for unexpected traffic patterns
For RDS:
- FreeStorageSpace — threshold alarm when below 20% of allocated storage
- DatabaseConnections — threshold alarm near max_connections
- CPUUtilisation — threshold alarm above 80%
- ReplicaLag — for read replicas, threshold alarm above 30 seconds
Alert Routing
QuickInfra configures SNS topics as alarm actions and lets you route alert notifications to email, Slack, or PagerDuty. Alarms are tiered by severity — StatusCheckFailed goes to PagerDuty (immediate response), disk usage warning goes to Slack (next business day).
Dashboard Generation
QuickInfra auto-generates a CloudWatch Dashboard for each Infrastructure Project with the most useful widgets: CPU, memory, disk, and network metrics per instance, plus an alarm status panel showing the current state of all alarms in the project.