AWS CloudWatch Part 2/3: Alarms and Alerts | Epsagon (2023)

AWS CloudWatch can monitor metrics and generate alarms when they cross a certain threshold. Additionally, CloudWatch can take various actions when this occurs, such as alerting humans through email, SMS, or Slack messages. CloudWatch typically provides alerts by posting a message to an SNS (Amazon Simple Notification Service) topic, which would then dispatch the message via a variety of mediums, such as email, SMS, and Lambda functions. Additional actions that CloudWatch can take when an alarm goes off are typically auto-scaling events, such as scaling workloads out or back in.

This is thesecondarticle out of three investigating what AWS CloudWatch can offer us, this time focusing on alarms and alerts.Let’s dive into more details.

Billing Alert

The most important tip of this article is that CloudWatch can alert you when your bill is likely to become too high. You should never run any workload on AWS without this alarm, as it is quite easy to forget resources spun up for testing purposes. To create such a billing alarm, you need to log in to the AWS console using either the root user or an IAM user who has permission to access the billing section of the console.

The first step is to enable billing alerts. Click on your user name in the top-right corner of the screen, then click on “My Billing Dashboard.” In the left pane, select “Billing preferences,” then enable “Receive Billing Alerts,” and finally click on “Save preferences.”

AWS CloudWatch Part 2/3: Alarms and Alerts | Epsagon (1)

The second step is to actually create a billing alarm. Head over to the CloudWatch service. In the left pane, click on “Alarms,” and then click on the “Create Alarm” button. Choose the “Billing, Total Estimated Charge” metric, and configure the alarm to go off when this metric is greater than a monthly amount you would like to be alerted; click “Next.” Then select “Create a New SNS Topic,” and enter a name for the SNS topic and your email address; click “Create topic.” Click “Next,” enter a name for the alarm, click “Next” again, and, finally, “Create Alarm.” That’s it!

How Are AWS CloudWatch Alarms Evaluated?

Evaluation Periods and DatapointsToAlarm

There are three settings that control when an alarm goes off:

  • Period is the length of time over which the underlying metric is evaluated. The alarm period does not need to match the underlying metric period, but it needs to be at least as long as the metric period. CloudWatch will essentially generate one alarm data point per alarm period, based on the value(s) of the underlying metric during that period.
  • Evaluation period is the number of alarm periods (or alarm data points) to take into account when determining whether the alarm is triggered or not.
  • DatapointsToAlarm is the number of alarm data points that must breach the threshold during the evaluation period for the alarm to go off.

For example, let’s assume that the alarm period is the same as the underlying metric period, the evaluation period is five, and the DatapointsToAlarm is three. For any five consecutive alarm data points, the alarm will go into the ALARM state if at least three out of five data points breach the alarm threshold. If for a given set of five consecutive alarm data points, only two or less breach the threshold, the alarm will not be in the ALARM state.

These settings are quite useful to filter out normal spikes, like CPU. It’s normal for the CPU to run very high for short periods of time because the OS is, for example, performing some maintenance tasks, and you probably wouldn’t want to be alerted for such things.

What Happens When There Is Missing Data?

There are various reasons why the underlying metric might have missing data points during an evaluation period, including:

  • The service or EC2 instance is just starting and hasn’t yet reported enough metrics.
  • The EC2 instance is rebooting and can’t report the metric during the reboot time.
  • Some networking parameters (such as a security group or network access control list) have been modified, preventing the EC2 instance from connecting to CloudWatch.
  • There are some network glitches (typically if the metric is reported from outside AWS).

You can configure each alarm to consider missing data points as:

  • Breaching: The missing data point is assumed to breach the alarm threshold.
  • Not breaching: The missing data point is assumed to be within the alarm threshold.
  • Ignore: The alarm state is left untouched.
  • Missing: This data point will not be considered when evaluating the alarm.

Which option to choose depends on the underlying metric reports. For example, if your alarm is for CPU, RAM, or disk utilization, it’s most probably safer to consider missing data as breaching, as it is likely that your EC2 instance is going through a rough time. Some metrics are reported only when errors occur (such as ThrottledRequests in Amazon DynamoDB), in which case missing data would be considered as not breaching.

Importantly, CloudWatch tries to use missing data points as little as possible, no matter how you configure the alarm to treat them. It does so by retrieving more data points than the evaluation period (this is an evaluation range) and tries to use as many valid data points as possible within the evaluation range.

High-Resolution Alarms

Alarms have a typical resolution of sixty seconds. If the underlying metric is a high-resolution one, you can use a high-resolution alarm with a resolution of ten or thirty seconds. Although they come at a higher cost, these could be useful if you need very swift action.

Advanced AWS CloudWatch Alarms Evaluation

Combining Metrics

CloudWatch can combine metrics using math. This will be explored in a subsequent article, but what is relevant here is that you can create alarms on the output of the math expression.

Anomaly Detection

AWS released a new service that generates alarms based on anomalous patterns in the underlying metric. In order to use this service, you simply create an alarm as usual and select “Anomaly detection” under “Conditions.” Alternatively, when you display a metric in CloudWatch, select the “Graphed metrics” tab and click on the wave icon next to the metric name (see screenshot below).

AWS CloudWatch Part 2/3: Alarms and Alerts | Epsagon (2)

Using anomaly detection, you can benefit from the years of experience that AWS has accumulated in monitoring a variety of workloads. It allows you to leverage machine learning in a very intuitive and user-friendly way. Anomaly detection uses machine learning and statistical analysis to determine the validity or “usual” range for a metric. It will then alert you whenever the metric goes out of range and is thus suspicious.

How to Alert a Human Being

CloudWatch can take a variety of actions when an alarm goes off, such as triggering an auto-scaling event or sending a message through a medium likely to attract the attention of a human. Typically, you would need to create an SNS topic and add subscriptions to that SNS topic. Each subscription represents a channel to which the alarm message will be forwarded. In the screenshot below, there are a number of subscriptions to the “Alarms” SNS topic. Any message sent to that SNS topic will be forwarded to all subscribers, whether by email, SMS, Lambda invocation, etc.

AWS CloudWatch Part 2/3: Alarms and Alerts | Epsagon (3)

The next step is to link the alarm with the SNS topic. So when you create or edit the alarm, you just need to select the correct SNS topic. After that, every time the alarm goes off, CloudWatch will send a message to SNS, which will forward it to you via email, SMS, etc. You can even create a Lambda function that will post the message on a Slack channel.

AWS CloudWatch Part 2/3: Alarms and Alerts | Epsagon (4)

AWS CloudWatch Alarms Cost Considerations

Don’t Create a Loop!

This piece of advice is fairly obvious, and it is to avoid creating a loop. For example, you might have an alarm on a Lambda function that sends a message to Slack. This could go into an infinite loop where the Lambda triggers the alarm, which triggers the Lambda function, etc. This is a silly example, and such situations are unlikely to happen in real life, but this is still something to keep in mind. Obviously, such a loop is likely to cost you a lot of money!

Avoid Sending Too Many Text Messages

Pricing information for SNS can be found here. The free tier allows you to send 100 SMSes (to US phone numbers) and 1,000 emails per month. After that, you will have to pay per SMS, which is reasonably cheap for US numbers but can have significant costs for other countries.

If you’re using Slack in your organization, a good way to minimize these costs is to post a message to a Slack channel instead.

Other Points

The pricing page for CloudWatch can be found here.

High-resolution alarms are about three times more expensive than regular alarms, so use them only if necessary. The costs for alarms and alerts are usually quite low, so you wouldn’t need to worry about those unless you have a large amount of them.

Another point to consider with CloudWatch is that everything needs to be taken care of manually, and there is a chance to miss or misconfigure an alert.

CloudWatch vs. Competition

There is a large offering of additional monitoring software tools out there, both in the open-source community and from paid-for suppliers.

The open-source community has a number of very capable software solutions. Prometheus is well known and able to ingest and process metric data; its time-series database is very efficient. Prometheus works well in most workloads but doesn’t scale very well for very large workloads. In contrast, CloudWatch has gigantic scaling capabilities built-in, so scale should never be a worry. In addition, Prometheus can only deal with metrics that are actually reported; if the agent running on the EC2 instance has some problems and doesn’t report metrics to the Prometheus server, you’re out of luck. You will also need a separate piece of software for visualization, such as Grafana.

Other software such as Zabbix and Nagios focus more on system monitoring, including probing and periodic testing of whether services are up or not. One of the main difficulties with both of these mentioned is that they are quite complex and pretty difficult to set up and maintain.

Third-party solutions are likewise numerous and include Sumo Logicand Epsagon. Most of these are very good, and going into detail on them would be beyond the scope of this article. Generally speaking, these solutions offer turn-key, easy-to-use features to analyze your metrics.

Conclusion

The way CloudWatch treats missing data can be annoying, especially if you want missing data points to always trigger an alarm since CloudWatch might use non-missing data points to evaluate the alarms, which might mean that the alarm does not go off. Nevertheless, CloudWatch Alarms is a very powerful and capable solution. It certainly holds its ground against dedicated solutions, such as Epsagon, especially if your needs are simple, which is the case for most workloads.

Additionally, AWS CloudWatch integrates very well with other AWS services, both for input (generating alarms) and output (taking action when alarms go off). In conclusion, you should probably evaluate CloudWatch in light of your requirements as the first port of call for monitoring your workload, especially if it is run on AWS.

Update: Our AWS CloudWatch series is now available on our blog. Check out part 1 (Logs and Insights) and part 3 (Metrics and Dashboards).

More from our blog:

Debugging Distributed Systems Using Automated Tracing

The Hitchhiker’s Guide to Serverless

5 Ways to Understand Distributed System Logging and Monitoring

Distributed Tracing: the Right Framework and Getting Started

FAQs

What are the 3 states of the CloudWatch metric alarm? ›

A CloudWatch Alarm is always in one of three states: OK, ALARM, or INSUFFICIENT_DATA.

Why does my CloudWatch alarm say insufficient data? ›

When you create a CloudWatch alarm, its first state by default is INSUFFICIENT_DATA. It remains in this state until it completes its first evaluation of the metric being monitored. Typically, an alarm transitions out of INSUFFICIENT_DATA within a few minutes of creation.

What are 3 things you can do in CloudWatch? ›

CloudWatch ServiceLens lets you gain visibility into your applications in three main areas: infrastructure monitoring (using metrics and logs to understand the resources supporting your applications), transaction monitoring (using traces to understand dependencies between your resources), and end-user monitoring (using ...

Can you configure the alarm to monitor more than 1 metric? ›

You can set an alarm on the result of a math expression that is based on one or more CloudWatch metrics. A math expression used for an alarm can include as many as 10 metrics. Each metric must be using the same period.

How do you make a CloudWatch alarm with multiple metrics? ›

To create an alarm based on a metric math expression, choose one or more CloudWatch metrics to use in the expression. Then, specify the expression, threshold, and evaluation periods. You can't create an alarm based on the SEARCH expression.

How do I test my AWS alarm? ›

To test notifications to configured chat clients

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/ . In the navigation pane, choose Alarms, Create alarm. Select the correct AWS Region at the top right of the AWS console, that contains the Amazon SNS topic you need.

What is sample count in CloudWatch? ›

SampleCount is the number of data points during the period. Sum is the sum of the values of the all data points collected during the period. Average is the value of Sum/SampleCount during the specified period. Minimum is the lowest value observed during the specified period.

How long does CloudWatch alarm stay in alarm state? ›

Alarm history is available for 14 days.

What is Datapoint in alarm? ›

Datapoints to Alarm is the number of data points within the evaluation period that must be breaching to cause the alarm to go to the ALARM state. The breaching data points do not have to be consecutive, they just must all be within the last number of data points equal to Evaluation Period.

What is the use of CloudWatch alarms? ›

Alarms in Cloudwatch are thresholds defined by you for specific metrics. Alarms trigger according to state changes, not current values. You can use these alarms to start, stop, or terminate resources or to send a notification to your team that something has changed.

What is a CloudWatch alarm dimension? ›

AWS::CloudWatch::Alarm Dimension

Dimension is an embedded property of the AWS::CloudWatch::Alarm type. Dimensions are name/value pairs that can be associated with a CloudWatch metric. You can specify a maximum of 10 dimensions for a given metric.

What are metrics in CloudWatch? ›

Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. Think of a metric as a variable to monitor, and the data points as representing the values of that variable over time.

Can CloudWatch alarm trigger Lambda? ›

You can use a CloudWatch Events rule that matches on alarm evaluation changes and then triggers a Lambda function that parses the alarm event and creates a customized notification.

What actions can I take from a CloudWatch alarm? ›

Using Amazon CloudWatch alarm actions, you can create alarms that automatically stop, terminate, reboot, or recover your EC2 instances. You can use the stop or terminate actions to help you save money when you no longer need an instance to be running.

How do I create a CloudWatch alarm for CPU utilization? ›

Setting up a CPU usage alarm using the AWS Management Console
  1. In the navigation pane, choose Alarms, All Alarms.
  2. Choose Create alarm.
  3. Choose Select metric.
  4. In the All metrics tab, choose EC2 metrics.
  5. Choose a metric category (for example, Per-Instance Metrics).

How do you make a composite alarm in CloudWatch? ›

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/ . In the navigation pane, choose Alarms, and then choose In alarm. From the list of alarms, select the check box next to each of the existing alarms that you want to reference in your rule expression, and then choose Create composite alarm.

What is threshold in CloudWatch alarm? ›

Threshold is 5 hours because you set the metric period to 5 minutes, which means each datapoint covers the span of 5 minutes, and you set it to alarm after 60 datapoints (60 * 5 = 300 minutes, or 5 hours). To change the period select the alarm and click Actions -> Modify .

Can I monitor resource by CloudWatch in multiple regions? ›

You can create cross-account cross-Region dashboards, which summarize your CloudWatch data from multiple AWS accounts and multiple Regions into one dashboard.

How do I change alarm state in CloudWatch? ›

To edit an alarm

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/ . In the navigation pane, choose Alarms, All Alarms. Choose the name of the alarm. Choose Edit.

What is the difference between CloudTrail and CloudWatch? ›

CloudWatch is a monitoring service for AWS resources and applications. CloudTrail is a web service that records API activity in your AWS account. They are both useful monitoring tools in AWS.

What is threshold value in AWS? ›

Describes a load-based auto scaling upscaling or downscaling threshold configuration, which specifies when AWS OpsWorks Stacks starts or stops load-based instances.

How do I monitor my CloudWatch logs? ›

When the CloudWatch dashboard appears, click on the Logs option, and then click on the number of metric filters that is displayed within your log group. (The number of metric filters will initially be set at zero.) If no log groups exist, you will have to create a log group before continuing.

How long are CloudWatch logs retained? ›

You can store your log data in CloudWatch Logs for as long as you want. By default, CloudWatch Logs will store your log data indefinitely. You can change the retention for each Log Group at any time.

How do I check my CloudWatch metrics? ›

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/ . In the navigation pane, choose Metrics, and then choose All metrics. Select a metric namespace (for example, EC2). Select a metric dimension (for example, Per-Instance Metrics).

How much does 1 alarm cost monthly? ›

Average Cost of Home Security System

The average cost of a home security system is $25 to $50 per month for monitoring services. Alarm system equipment costs $199 to $399 upfront and $0 to $199 for security system installation and activation fees.

Does CloudWatch send alerts? ›

CloudWatch alarm sends the first alarm notification to the associated SNS alarm actions. CloudWatch Alarms service sends an alarm state change event which triggers the EventBridge rule.

How do I set up alerts in AWS CloudWatch? ›

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/ . In the navigation pane, choose Instances. Select the instance and choose Actions, Monitor and troubleshoot, Manage CloudWatch alarms. On the Manage CloudWatch alarms detail page, under Add or edit alarm, select Create an alarm.

What is Datapoint in CloudWatch alarm? ›

A datapoint is the value of a metric for a given metric aggregation period i.e. if you use one minute as an aggregation period for a metric, then there will be one datapoint every minute.

Why is my CloudWatch alarm in an alarm state? ›

The alarm goes to ALARM state when the metric breaches the threshold for a specified number of evaluation periods. Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/ . In the navigation pane, choose Alarms, All alarms. Choose Create alarm.

What is Composite alarm? ›

A composite alarm is comprised of multiple individual alarms. The state of a composite alarm is a combination of the states of the individual alarm rules. Each individual alarm rule in a composite alarm must have the same metric scope, but each alarm can analyze a different metric.

How do you make a CloudWatch alarm based on logs? ›

Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/ .
  1. From the navigation pane, choose Logs, and then choose Log groups.
  2. Choose the log group that includes your metric filter.
  3. Choose Metric filters.
  4. In the metric filters tab, select the box for the metric filter that you want to base your alarm on.

How do Alerts work in AWS? ›

You can configure a detector to run an AWS Lambda function to process anomaly alerts, or send details to an Amazon Simple Notification Service (Amazon SNS) topic. Amazon SNS can then send the information to email subscribers or an HTTP endpoint, among numerous other supported destinations.

What are the best practices for cloud alarm configuration? ›

Optimal alarm configuration addresses the following factors:
  • Criticality of the resource.
  • Appropriate resource behavior. Assess behavior singly and within the context of the service ecosystem. ...
  • Acceptable notification noise.

How long are CloudWatch metrics kept? ›

To offer increased flexibility, CloudWatch added functionality to store metrics for up to 15 months at no additional charge. To keep the overall amount of data reasonable, historical data is stored at a lower granularity level, as indicated below: 1-minute data points will be available for 15 days.

Can you delete CloudWatch metrics? ›

To delete a metric filter using the CloudWatch console

In the navigation pane, choose Log groups. In the contents pane, in the Metric Filter column, choose the number of metric filters for the log group. Under Metric Filters screen, select the check box to the right of the name of the filter that you want to delete.

What is CloudWatch custom metrics? ›

A custom metric enables you to monitor a specific application binary or runtime. CloudWatch helps you monitor the infrastructure portion of an EC2 instance, such as CPU, hard disk and network.

What is the difference between CloudWatch logs and metrics? ›

While logs are about a specific event, metrics are a measurement at a point in time for the system.

What is CloudWatch event rule? ›

Amazon CloudWatch Events delivers a near real-time stream of system events that describe changes in Amazon Web Services (AWS) resources. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams.

Is CloudWatch a time series? ›

AWS CloudWatch metrics

A metric is a variable that stores a time series data set. AWS services push metrics to CloudWatch. You can then get useful information about those metrics from CloudWatch.

What are the metrics in CloudWatch? ›

Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. Think of a metric as a variable to monitor, and the data points as representing the values of that variable over time.

What is a CloudWatch metric dimension? ›

A dimension is a name/value pair that is part of the identity of a metric. Because dimensions are part of the unique identifier for a metric, whenever you add a unique name/value pair to one of your metrics, you are creating a new variation of that metric.

How do I change alarm state in CloudWatch? ›

To edit an alarm

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/ . In the navigation pane, choose Alarms, All Alarms. Choose the name of the alarm. Choose Edit.

Which three components are emitted along with raw data points or timestamp value pairs as metrics to the monitoring service? ›

The Monitoring service uses metrics to monitor resources and alarms to notify you when these metrics meet alarm-specified triggers. Metrics are emitted to the Monitoring service as raw data points , or timestamp-value pairs, along with dimensions and metadata.

What actions can I take from a CloudWatch alarm? ›

Using Amazon CloudWatch alarm actions, you can create alarms that automatically stop, terminate, reboot, or recover your EC2 instances. You can use the stop or terminate actions to help you save money when you no longer need an instance to be running.

How do I monitor my CloudWatch logs? ›

When the CloudWatch dashboard appears, click on the Logs option, and then click on the number of metric filters that is displayed within your log group. (The number of metric filters will initially be set at zero.) If no log groups exist, you will have to create a log group before continuing.

What is the difference between CloudWatch logs and metrics? ›

While logs are about a specific event, metrics are a measurement at a point in time for the system.

How long are CloudWatch logs retained? ›

You can store your log data in CloudWatch Logs for as long as you want. By default, CloudWatch Logs will store your log data indefinitely. You can change the retention for each Log Group at any time.

What is data point in CloudWatch alarm? ›

A datapoint is the value of a metric for a given metric aggregation period i.e. if you use one minute as an aggregation period for a metric, then there will be one datapoint every minute.

How long are CloudWatch metrics kept? ›

To offer increased flexibility, CloudWatch added functionality to store metrics for up to 15 months at no additional charge. To keep the overall amount of data reasonable, historical data is stored at a lower granularity level, as indicated below: 1-minute data points will be available for 15 days.

Can CloudWatch alarm trigger Lambda? ›

You can use a CloudWatch Events rule that matches on alarm evaluation changes and then triggers a Lambda function that parses the alarm event and creates a customized notification.

How do you make a custom alarm on CloudWatch? ›

Create an alarm for your custom metric. On the CloudWatch console, choose Alarms and then choose Create Alarm. Choose Select metric and enter the name of the metric that you created earlier into the search box.

How do you make a CloudWatch alarm based on logs? ›

Open the CloudWatch console at https://console.amazonaws.cn/cloudwatch/ .
  1. From the navigation pane, choose Logs, and then choose Log groups.
  2. Choose the log group that includes your metric filter.
  3. Choose Metric filters.
  4. In the metric filters tab, select the box for the metric filter that you want to base your alarm on.

What is used to specify the actions that cloud Guard can take when detectors identify problem? ›

Rules followed to identify problems are the same for all compartments in a target. Specifies actions that Cloud Guard can take when detectors identify problems. Rules for how to process identified problems are the same for all compartments in a target.

Which three types of logs are supported by the OCI logging service? ›

Built on open standards, OCI Logging is an intuitive, centralized platform for all types of logs, such as audit, infrastructure, database and applications, which are needed for DevOps and security compliance.

Which type of logs are emitted by API gateways events and object storage? ›

Service logs: Emitted by OCI native services, such as API Gateway, Events, Functions, Load Balancer, Object Storage, and VCN Flow Logs.

Top Articles
Latest Posts
Article information

Author: Mrs. Angelic Larkin

Last Updated: 02/11/2023

Views: 6363

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Mrs. Angelic Larkin

Birthday: 1992-06-28

Address: Apt. 413 8275 Mueller Overpass, South Magnolia, IA 99527-6023

Phone: +6824704719725

Job: District Real-Estate Facilitator

Hobby: Letterboxing, Vacation, Poi, Homebrewing, Mountain biking, Slacklining, Cabaret

Introduction: My name is Mrs. Angelic Larkin, I am a cute, charming, funny, determined, inexpensive, joyous, cheerful person who loves writing and wants to share my knowledge and understanding with you.