Reduce AWS operational issues with OpsCenter
Operating reliable applications in the cloud can be a complicated task. Administrators are responsible for monitoring, maintenance, issue prevention, early anomaly detection, fast remediation, task automation and more.
AWS offers many services and features that help with operational efficiency, but these features are spread across different products and can be difficult to manage. On top of that, there are a number of common tasks that need to be repeated for every application, such as configuring alarms and dashboards, managing notifications and executing scripts.
AWS Systems Manager offers several tools to centralize the management and execution of many areas related to operational efficiency. OpsCenter is a feature within Systems Manager that can be used to monitor, detect, track and remediate specific AWS operational issues.
OpsCenter introduces the concept of OpsItems — operational work items such as events or alerts that require an operations team’s attention. An OpsItem is similar to a ticket, which most systems engineers are familiar with. An OpsItem can be created manually or automatically as a result of a system failure, maintenance task, anomaly or state change, among other events.
Similar to a traditional ticket, OpsItems have a title, description, priority level and grouping mechanism. A key difference is that OpsItems integrate natively with many AWS components, including CloudWatch Events, AWS Config, CloudTrail, Amazon Simple Notification Service (SNS) topics and Systems Manager Automation documents.
How to create OpsItems
Developers create OpsItems manually using the AWS Management Console or with the AWS Command Line Interface (CLI), or they can create them automatically as a target for CloudWatch Events. To get started with OpsCenter, developers have the option to configure a set of common CloudWatch Events to trigger the creation of OpsItems, such as EC2 Instance State-change, Elastic Block Store (EBS) Snapshot Notifications (Copy, Create, Share, Failed), Relational Database Service (RDS) Issue Notification and EC2 Issue Notification. However, there is no direct way to create OpsItems from CloudWatch Alarms, which would be a great improvement to the service.
A particularly useful feature of OpsItems is built-in deduplication logic. Users can configure deduplication strings to avoid creating repeated events. In some systems, multiple operational issues can be triggered as a result of a single event. Situations like this can create a significant amount of noise and make issue resolution more difficult.
Assign resources to an OpsItem
OpsCenter is a central place to resolve AWS operational issues, which means an OpsItem isn’t useful unless it has at least one AWS resource assigned to it. For example, users can assign an OpsItem to an EC2 instance that requires maintenance or a failing EBS volume. When an OpsItem is created using CloudWatch Events, a list of associated AWS resources that need to be investigated is created by default. When using the Systems Manager console or CLI, users can specify resource assignment when they create an OpsItem.
Once AWS resources are assigned to an OpsItem, OpsCenter displays all the relevant resource information. For example, in the case of an EC2 instance, the OpsCenter GUI consolidates all the information that would be visible in the EC2 console for that instance.
There are more than 30 supported resource types that can be associated to an OpsItem. These include AWS CloudFormation stacks, AWS CodePipeline, Lambda functions and more. However, as of publication, not all AWS resource types are supported.
Users can provide additional information by adding operational data to their OpsItem. These are basically additional reference notes such as troubleshooting tips or license keys. Users enter operational data as key-value pairs.
Solve ops issues with OpsCenter
OpsCenter’s integration with Systems Manager Automation documents is particularly useful for resolving AWS operational issues. They can be configured to call any AWS API and define multistep workflows that involve calling more than one AWS API. Systems Manager Automation documents can even wait for conditions, such as when an EC2 instance has been terminated or an RDS instance has been launched. Therefore, they can be used to execute potentially complex remediation tasks.
- Ernesto Marquez asks:How could Systems Manager OpsCenter alleviate challenges for your organization’s operations team?Join the Discussion
OpsCenter has a list of more than 60 prebuilt Systems Manager Automation documents, also referred to as runbooks. They can automatically handle tasks such as EC2 instance state change, creating a JIRA issue, creating a DynamoDB backup, copying an RDS snapshot and more. Users can take any of those runbooks as a reference and create their own custom remediation options.
OpsCenter also integrates with Amazon SNS, which means OpsItems can be configured to send notifications to an SNS topic whenever there’s an OpsItem update. Notifications can be sent directly to team members via email, text message or popular collaboration tools.
OpsCenter is a good fit for ops teams that want to automate tasks but still desire some manual intervention capabilities. While AWS offers the option to react in a fully automated way to certain CloudWatch Alarms or CloudWatch Events, there are situations where teams might want to assess the situation and choose which action to take.