Triaging problems with CloudWatch Log Insights

You never know when the next problem may appear in your system.

It could be from the code you wrote, a library used, an external service, or even quantum physics causing transistors to go awry. Even with the highest code quality, test coverage, and the most rigorous review process, problems can still arise.

When that inevitable error hits, we need a way to find out what is going on.

A good starting place is to look at system logs. In this piece, we will cover the Log Insights feature of CloudWatch. CloudWatch is the observability and monitoring solution provided by AWS.

CloudWatch & observability

An observability platform provides the ability to gain insight in a system. They usually ingest data such as logs, metrics, and traces so that an engineer can use it to explore patterns and derive meaningful interpretations of system data.

There are many observability solutions available. It is not uncommon for larger companies to adopt multiple observability platforms to cater for different preferences.

CloudWatch benefits from being integrated within AWS, there is not much overhead to configure for a user that is already on the AWS system. It is also priced on the lower end of the spectrum.

For example, the price of data ingestion in the London region is currently only $0.5985 per GB, per month, after exhausting the free tier. For comparison, Datadog starts at $0.10 per ingested or scanned GB, per month. These attributes help with the availability of CloudWatch on software engineering projects.

Observability systems are feature-rich and have a lot of depth, and CloudWatch is no different. In this blog I will be focusing on Log Insights of Cloudwatch, since I think that is a good entry point and it is the component of CloudWatch that I have interacted with the most as a developer.

My aim is to provide some tips to help those who are unfamiliar with CloudWatch to begin writing queries on Log Insights.

Visiting the Log Insights View

We can visit Log Insights on the CloudWatch console via the sidebar on the left.

Screengrab of CloudWatch Log Insights page.

In order to start querying, we must first select a log group to query on.

Log events that share the same source form a log stream, and streams are put into log groups that share retention and access control. Logs from the same application are likely to be put into the same group.

Log Insights use a custom query language that may initially appear alien to some, but it comes with some sample queries to help us get started.

When we visit Log Insights, we can see the default query:

fields @timestamp, @message
| sort @timestamp desc
| limit 20

The default query will return the latest 20 messages from the selected log group(s).

The ‘fields’ command selects specific fields to be shown in the list of results. You can select a row from the results to show all of the fields in that message. The internal fields @timestamp and @message contain the timestamp of the event when it is ingested and the raw unparsed log event.

The query then uses the ‘sort’ command to arrange the messages by timestamp in descending order. Finally, the results returned are limited to 20.

Time Range

The time range window is a convenient filter for events by time, although it is still possible to filter using the query language when preferred. The time range window can set a relative window, such as the past 30 days, or an absolute window.

Time range is an effective way to reduce the volume of messages retrieved, thus improving execution time and reducing the cost of analysis.

Screengrab of CloudWatch Log Insights calendar.

Sample Queries

There is a handy folder of sample queries available to get started with. These queries allow us to ask questions about our systems, such as which Lambda requests are the most expensive, or which IP addresses are using UDP transfer protocol on your VPC.

Screengrab of CloudWatch Log Insights queries.

Let us break down some of the common queries.

Sample: “Number of exceptions logged every 5 minutes”

filter @message like /Exception/
| stats count(*) as exceptionCount by bin(5m)
| sort exceptionCount desc

This query uses the filter command on the @message field for instances of ‘Exception’.

We then use the stats command to count all the messages as a field called ‘exceptionCount’ and separate it into bins of 5 minutes.

The filter command

Using the filter command removes messages that do not fit the criteria.It supports comparison operators (=, !=, <, <=, >, >=) and Boolean operators (and, or, not).

The ‘like’ keyword can be used to match substrings using a string or regular expression. The following commands yield the same results:

| filter @message like “Exception”
| filter @message like /Exception/
| filter @message =~ /Exception/

Likewise, you can build more complex queries by applying both boolean logic or regular expressions as well:

| filter strcontains(@message, "logged in") or strcontains(@message, "but not registered with")
| filter @message like /(logged in|completed challenge)/

The filter command has too many keywords and patterns to cover in this blog post, do refer to the query syntax documentation if you want to find more detail.

The stats command

The stats command is used to calculate aggregate statistics. It supports aggregate functions like:

Average values
Count
Min/Max
Percentiles
Standard deviations
Sum

It also supports non-aggregation functions such as: earliest, latest, first, and last.

The stats command is useful for reporting and can be very powerful when used to extract fields for further analysis.

Sample: “To parse and count fields”

fields @timestamp, @message
| filter @message like /User ID/
| parse @message "User ID: *" as @userId
| stats count(*) by @userId

The parse command

The parse command allows you to use expressions to extract values into a custom field.

The sample query above extracts everything that fits within the asterisk wildcard in the string “User ID: *”. That means if you had messages “User ID: Bob” and “User ID: Alice” then you will create a new field @userId and it will have values “Bob” and “Alice”.

We can use a query like this to find out how many events with this log format are tied to each userId. Extracting fields allow for more comprehensive queries and reporting.

Automatically Discovered Fields

Logs from certain AWS resources will automatically include fields that provide more information that is unique to the service.

Take the sample query to find the most expensive Lambda requests for example:

filter @type = "REPORT"
| fields @requestId, @billedDuration
| sort by @billedDuration desc

The fields @type, @requestId, @billedDuration are unique to Lambda and are automatically available when querying with Log Insights.

It is worth noting that AWS resources send metrics to CloudWatch. For example, EC2 regularly sends instance metrics such as CPUUtilization. Common metric categories include resource utilisation, latency, network traffic, provisions, and so on. These metrics are queried using Metrics Insights, which is another CloudWatch component that uses a SQL query engine.

Conclusion

CloudWatch Log Insights allows us to make queries to find out more information about our systems. It is often the place to start when we have questions.

We have discussed some essential commands for filtering, sorting, extracting fields, and computing statistics. We also identified integration with other AWS resources such as automatically discovered fields.

These are concepts I would have considered useful to know when I first started making queries on Log Insights, and should provide a good foundation for those who are not familiar with CloudWatch.

Useful Links

CloudWatch Query Syntax: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html

Sample Queries:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html

Supported Logs and Discovered Fields:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData-discoverable-fields.html

List of CloudWatch Metrics for EC2:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html

Services

Digital strategy

Research and design

Engineering

Data

Solutions

15 August 2023

Triaging problems with CloudWatch Log Insights

CloudWatch & observability

Visiting the Log Insights View

Time Range

Sample Queries

The filter command

The stats command

The parse command

Automatically Discovered Fields

Conclusion

Useful Links

15 August 2023

Triaging problems with CloudWatch Log Insights

CloudWatch & observability

Visiting the Log Insights View

Time Range

Sample Queries

The filter command

The stats command

The parse command

Automatically Discovered Fields

Conclusion

Useful Links

Police Scotland: User-centred design and data development day

From Teacher to Digital Champion: Rheanna O’Donaghue

Mike Tattersall: Unconventional Paths and Purposeful Projects