Understanding AI-Powered Analysis with CloudWatch Investigations and the Bigger Picture of Incident Management (Part 1)

Hello everyone,

In this article, I will talk about the AI-powered CloudWatch feature that simplifies root cause analysis for issues occurring in your AWS environment, as well as the possibilities for semi-automating an incident analysis process.

Let me first mention the CloudWatch Investigations feature, announced on June 24. This feature automatically analyzes CloudWatch metrics, logs, events, deployment data, AWS Health, and CloudTrail records when an alarm or performance issue occurs in your system, providing you with potential causes and solution suggestions.

For example, imagine you no longer need to manually review all metrics and logs when a performance issue arises. CloudWatch Investigations can examine all the values on your behalf and initiate an analysis with the support of Amazon Q. This can make problem resolution during a crisis much easier.

As another example of a use case, this feature can assist first-level support personnel in analyzing issues in a shift-based work environment and help newly hired engineers with limited knowledge of the environment to perform easier issue analysis. Many other convenience and usability scenarios can be considered in this way.

Since this feature supports cross-account functionality, you can add and analyze multiple accounts. Imagine you have an environment operating in a Hub & Spoke structure. Suppose all network traffic exits through the network account, and the solution you use there (such as a Fortigate, Palo Alto, or AWS Network Firewall within an EC2 instance) encounters an issue. Everything in Account A is functioning correctly, but the application cannot access the internet. In such a case, you can use cross-account analysis to examine dependencies across multiple accounts.

In general, an investigation is initiated through an alarm, metric, or log query. CloudWatch Investigations scans the relevant data from associated resources using your IAM permissions. The AI then generates observations, suggestions, and hypotheses from this data. All these actions are also recorded in CloudTrail.

When you create an Investigation Group, you gain access to the following options:

You can accept or reject AI suggestions, allowing the system to learn from your feedback.
You can add cross-account access to analyze data from different AWS accounts.
You can include additional data sources such as CloudTrail, X-Ray, Application Signals, and EKS.
You can execute automated remediation actions through runbook recommendations (Systems Manager Automation).
You can add and share notes, enabling collaborative review of the report with team members.
You can perform actions such as stopping, archiving, or reopening the investigation.

At the same time, with the Incident Report creation feature announced on October 22, you can automatically create a report output that would otherwise need to be prepared manually in the event of an incident based on the research group you created.

The report is structured according to industry-standard incident report formats:

Incident Overview:
– A general summary of the incident, including its severity, duration, and operational hypothesis.
Impact Assessment:
– The impact of the incident on customers, services, and business operations.
Detection and Response:
– When and how the incident was detected and how the team responded.
Root Cause Analysis:
– A detailed analysis of the underlying causes of the incident.
Mitigation and Resolution:
– The mitigation steps, resolution measures, and resolution timeframes.
Learning and Next Steps:
Future recommendations, preventive actions, and improvement plans.

When you choose Generate report, the system creates the report. AI combines all extracted facts and produces a comprehensive analysis.

Lastly, let’s talk about the pricing for this feature. The CloudWatch Investigations capability is now generally available at no additional cost.

Now that I’ve explained what this feature does and how it can be activated, I’d like to discuss how incidents can be detected in your account, how notifications can be sent to the people responsible for analyzing captured incidents, and what time-saving options are available to those reviewers. As you can probably guess, one of these options will be CloudWatch Investigations.

Let’s imagine we have an application running in our AWS environment, and we want an incident mechanism to be triggered whenever there’s an issue with this application. To do this, we first need a service to monitor both our application and its infrastructure. Here, CloudWatch Metrics and Logs monitor the infrastructure, while CloudWatch Application Insights monitors the application itself.

Now, suppose we’re already monitoring everything, but we also want to automatically trigger response mechanisms (such as phone calls, Slack or Teams notifications) and manage who will handle the issue when it arises. In that case, we use Systems Manager Incident Manager.

Once we’ve identified the bottleneck, generated a notification, and delivered it to the recipient, if you’re wondering how we can further assist the engineer in resolving the incident faster, that’s where CloudWatch Investigations comes into play.

Now you can see why I mentioned CloudWatch Investigations AWS offers many different services, and the best part is how seamlessly they can work together as a unified system.

In other words:

Application Insights detected a CPU spike in one of the services and created an alarm.
This alarm triggered Incident Manager, which alerted the on-call engineer and opened a runbook.
The team initiated the incident, and CloudWatch Investigations stepped in to perform an AI-powered root cause analysis, revealing that the issue was caused by increased database queries following a new deployment, leading to a solution.

I created this content to explain how this feature fits as part of a larger system, how it connects to other services, and what can be achieved when integrated together. In the next article of this series, I will discuss end-to-end incident management in an AWS environment, presenting it through a real customer case.

I hope this article helps save you time.

References:

1. AWS Docs
2. AWS Docs 2

Paylaş: