Automation 6 min read 25 August 2025

CloudWatch Alarms and Metrics: Setting Up Meaningful Monitoring on AWS

Most teams set up too many low-signal CloudWatch alarms or too few. Here's QuickInfra's opinionated approach to setting up monitoring that actually tells you something useful when something breaks.

QuickInfra Team

QuickInfra Cloud Solution

CloudWatch Monitoring AWS Alerts Observability

CloudWatch Alarms and Metrics: Setting Up Meaningful Monitoring on AWS

Good monitoring tells you something is wrong before your users do, and gives you enough context to know where to look. Bad monitoring either floods you with noise until you start ignoring alerts, or misses real problems because the thresholds are calibrated to be permissive. QuickInfra's monitoring configuration aims for the former.

The Signal vs Noise Problem

An EC2 instance sending a CPU alarm every time it spikes to 80% during normal batch processing is a noise alarm. It fires frequently, nothing bad happens, and the team learns to ignore it. When the instance eventually has a genuine problem — stuck process, runaway thread — the alarm fires again and nobody acts on it.

QuickInfra's approach: alarm on anomaly, not on threshold. For CPU, the alarm fires when current utilisation deviates significantly from the 30-day baseline for that specific instance at that time of day — not when it crosses a fixed number.

The Standard Alarm Set

For each EC2 instance, QuickInfra configures:

CPU Utilisation — anomaly detection model
StatusCheckFailed — fires when the EC2 status check fails (immediate action required)
Disk Utilisation — threshold at 85%, predictive alarm if trending to 85% within 6 hours
Memory Utilisation — threshold at 80% (requires CloudWatch Agent)
Network In/Out — anomaly detection for unexpected traffic patterns

For RDS:

FreeStorageSpace — threshold alarm when below 20% of allocated storage
DatabaseConnections — threshold alarm near max_connections
CPUUtilisation — threshold alarm above 80%
ReplicaLag — for read replicas, threshold alarm above 30 seconds

Alert Routing

QuickInfra configures SNS topics as alarm actions and lets you route alert notifications to email, Slack, or PagerDuty. Alarms are tiered by severity — StatusCheckFailed goes to PagerDuty (immediate response), disk usage warning goes to Slack (next business day).

Dashboard Generation

QuickInfra auto-generates a CloudWatch Dashboard for each Infrastructure Project with the most useful widgets: CPU, memory, disk, and network metrics per instance, plus an alarm status panel showing the current state of all alarms in the project.

View all

Automation

CloudWatch Alarms and Metrics: Setting Up Meaningful Monitoring on AWS

The Signal vs Noise Problem

The Standard Alarm Set

Alert Routing

Dashboard Generation

More Posts

InfraOps Monitoring: Real-Time Visibility Into Your Cloud Stack

Kubernetes Infrastructure Automation: When to Use K8s and When It's Overkill

Custom Scripts in QuickInfra: Automate Any Ops Task Without Building a Pipeline