Choosing the Right Monitoring Tools for a Full-Stack Solution

Source – devops.com

According to the marketsandmarkets report on APM tools, the monitoring tools market is expected to grow to $4.98 billion by 2019, at a CAGR of 12.86 percent and the DevOps market would attain market size of $8,763.8 million by 2023, growing at a CAGR of 18 percent during 2017 -2023 (source: markets business insider), with more focus on monitoring and performance management, delivery and operations management, lifecycle management, analytics and other solutions.

Selection Criteria for Full Stack Monitoring Tools

With the amount of DevOps tools and new tools coming up in the market, it is highly essential to choose the right tool for the right need before investing in it and expecting the desired outcomes. One has to make several considerations and analysis in choosing the right tool for the need. The monitoring tool you select should be capable of doing full stack end-to-end monitoring and should enable faster troubleshooting and quick remediation.

Following are some of the key criteria that can be used as factors for the monitoring tools analysis.

Monitoring Features

The mix of platforms, operating systems, devices and applications can be extremely diverse in the enterprise, which needs monitoring of infrastructure, platforms, cloud environments, application performance, network monitoring, orchestrations, containers, databases, end user monitoring, covering heterogeneous workloads and its performance. Check out what features are supported in the monitoring tools which suits the need.

Infrastructure Monitoring

The infrastructure monitoring tools should measure response time, availability, uptime, CPU usage, memory usage, disk usage, process level usage and load of the servers, components, storage, databases, virtual systems, network switches, security, user permissions, performance and throughput on the application and web servers, DB level queries and DB transactions, latency, transfer rate, connection time of databases, and provide time series data of the measurements, aggregation of data with process level drill down and history of trends. Tools that can do infrastructure monitoring should be able to provide most of the above mentioned measurements.

Network Monitoring

Network monitoring tools should provide performance metrics such as bandwidth, latency, responsiveness, different port level metrics, network packets flow, CPU usage of hosts and provide custom metrics and aggregation of these metrics. Network monitoring tools need a comprehensive platform that works across varied network topology (heterogeneous networks (wired/wireless), cloud-based networks and software defined networks). Choose the tools that are capable of auto discovery of nodes, devices, routers, firewall rules, ports and protocols, automatic rules, thresholds and alerts, and which are application-aware, that can detect the applications, services in the network with network path, network flow analysis and provide graphs/tables about network status to determine if the performance problem arises from network or application itself, and can generate network topology map.

Containers Monitoring

Monitoring containers/micro services include capturing container level states on CPU usage, memory, network, disk usage, processes running inside the containers, as well as usage of containers hosts, container level logs, containers running/stopped, orchestration level metrics on pods states, cluster stats, container networks, application endpoints at container level, pod level, and system level, performance metrics at container levels as well as services running in it, etc. Containers run in clusters and monitoring ephemeral container workloads are necessary in production; choose tools that provide the above capabilities for container monitoring beyond the out-of-box monitoring data provided by the container orchestrators.

Application Performance Monitoring

Application performance monitoring is a key component in full-stack monitoring, where logs are collected and centralized with profiling and tracing available on the application and its dependencies, provides measurements on performance such as end user response time, availability, throughput, error rate, page loads, slow pages, error occurrences, any third-party JavaScript slowness, browser speed, track SLAs, page bloats, distribution of customer base across geographical areas, checks for end user transactions and simulate end user actions with synthetic transactions and provide detailed drill down and correlation data from end user experience, providing full visibility across the application stack. Choose APM tools that cover the profiling, tracing, end user experience monitoring, log monitoring and log analytics, providing data correlation with drill-down capability on transactions and full-stack visibility, at a minimum.

Cloud Monitoring

Cloud providers offer monitoring capabilities for the workloads running in their cloud environment providing log trails, performance metrics, network level metrics, in-depth visibility of application components, measurements on microservices, PaaS services, and also offers storage and analysis on instrumentation data. The tools choice can depend on the support for interoperability and integration to multiple cloud providers, exchange of data between different databases/storages, perform transformation/enrichment of data and that which provides visualization of data with correlation and business insights from end user experience perspective with drill down approach and performs auto-remediation, auto-discovery, baseline and predict metrics.

Deployment Model and Pricing

Monitoring tools are available in on-premises, SAAS cloud-based deployment models, open source/commercial versions and, depending on the complexity in configuring the tools, maintenance and the support needed, appropriate deployment model and pricing/licensing needs, an appropriate tool can be chosen; however there seems to be increasing trend in the usage of SaaS-based monitoring tools.

Agent-based/Agentless Approach of Monitoring

Monitoring tools collect the data from the servers and applications using agent-based/agentless, push- or pull-based approach. The collectors can act as metrics/log shippers, which push the data to the centralized server for storage and processing. Agent-based approach may involve deploying the monitoring agents in the machines considered for monitoring and this can be deployed to the target machines, through configuration management tools/infra automation scripts such as Chef, Puppet or Ansible.  Tools may support auto discovery of machines, which can register itself as the client machine for monitoring. Agent-based monitoring can help monitoring the fellow agents as well. Choose the tool that suits the needs and the effort needed in maintaining the agent-based monitoring tools.

Alerting

Notifications set up on the predefined thresholds and rules are a must-have for monitoring tools. Tool should provide capability to send mails and web hooks, integrate to collaboration/ChatOps tools such as Slack or Hipchat and provide the ability to define custom thresholds and run batch alerts at intervals.

Dashboards 

Monitoring data and metrics can be used for decision-making and stakeholders’ buy-in, for which centralized end-to-end dashboards on metrics of different sources and components are needed. Many commercial tools provide aggregated as well as detailed correlation information and segregated data of applications under monitoring. The tools must provide out-of-the-box dashboard capability and capable to add/update new dashboards views and create graphs, create custom queries depending on the need.

Customizability

Many monitoring tools may provide the features for monitoring different workloads, but determining whether the tools provide customizations on defining custom metrics, custom dashboards, custom checks and plugins, community support on extending the tools features, customizable workflows, customizing thresholds and rules and the ability to write customized queries can influence your choice of tools for monitoring.

Extensibility

Despite performing monitoring, things may break; hence, monitoring tools should have extensibility to the collaboration tools which can alert through the proper channel to the team to act upon. Integration with ticket-based controls and incident management workflows, auto-ticketing mechanisms, auto-remediation tools and REST API support are some of the key items to look for when evaluating the monitoring tools.

Production Grade

Determine how the monitoring tools can perform in the given load and the support for production cluster setup. The tool should be capable of responding within the defined seconds, for the defined set of servers and time interval and store time series data collected from different sources (servers, devices, applications, etc.) and check whether the monitoring tool can scale up depending on the load. Scalability, performance can largely influence the decision of selecting the monitoring tools.

Analytics – AIOps

Commercial monitoring tools provide capability of running machine learning techniques, which can perform anomaly detection in the data, help predict the capacity and demand and maintain inventory based on the metrics. Depending on the advanced use cases the business demands, choose the tools which can provide these capabilities that can accelerate the IT operations maturity level.

Ticketing and Incident Remediation

Despite deep monitoring, defects may leak and SRE teams may have to triage and resolve the issues. Autoticketing and autoremediation workflows can be defined in the ITSM tools, which can integrate with the monitoring tools and trigger the remediation workflows based on the alerts defined and notify the SRE teams through different channels. Choose monitoring tools that can be easily integrated with the ITSM tools and provides out-of-box or options to develop custom workflows without much complexity, to achieve use cases such as auto-ticketing based on threshold alerts in monitoring and autoremediation.

Conclusion

When choosing the full-stack monitoring tools, choose the tools that offer monitoring capabilities for varied workloads including containers, IoT and comprehensive platform with user interface and integrated dashboards, along with integration and interoperability between the IT operations tools, ITSM tools and AIOps tools, which provide event correlation and analytics Choosing the tools that are right-sized and address the skillset, technology needs, pricing and deployment models, with a focus on must-have features, will enable SRE teams to triage and remediate the issues faster.