Troubleshooting microservices performance problems
Using microservices to build applications can significantly improve developer productivity, project agility and code reuse. However, the resulting architecture is more complex, which makes isolating and debugging performance problems much harder.
As we discussed in an earlier article, analyzing and managing microservice performance is a “two-level problem comprising both granular telemetry of individual microservices and comprehensive, end-to-end, pan-application measurement. Microservice telemetry is used to identify internal bottlenecks, inefficiencies and bugs while application-wide monitoring seeks to see performance from the user’s perspective and identify problems within the microservice ecosystem and its connections.” In this follow up, we look at specific techniques and tools for troubleshooting microservices performance problems.
Data integration and analysis
The raw material for application performance analysis is data and, particularly when using microservices, the more the better. This means the overall application and individual microservices should be designed upfront with telemetry in mind and application architects should have a strategy for data generation, collection and analysis with specific requirements each microservice developer is expected to follow. The strategy should incorporate both macroscopic, end-to-end performance measures for the application as a whole and internal telemetry from individual microservices. The latter is what’s known as in-situ monitoring in which microservice developers add functions to each service that log relevant data and events that can be aggregated and analyzed for performance-robbing behaviors.
A microservices performance and troubleshooting plan should incorporate three broad categories of data:
- Metrics that record the services and functions or operations within microservices that failed or exceeded baseline limits and by what amount.
- Logs that detail the sequence of application activity that can be analyzed after an incident to trace the events leading up to a problem. For example, logs can show the specific microservices active during a problem and the API calls used and parameters passed.
- External events affecting an application preceding or during a problem. Integrating with external systems such as code repositories, continuous integration/continuous deployment software or container orchestrators allows developers and IT to determine exogenous factors that could have caused a problem. For example, a code check-in generating a new version of a particular microservice, one or more microservices being moved to a different physical machine, resource starvation from other workloads on a microservice cluster, or individual server.
Given the number of microservices a cloud-native application is likely to use and the resulting wide variety of data sources and types, the data is most useful when aggregated into a single repository. A unified data pool increases the effectiveness of log analysis software such as Loggly, Splunk, Sumo Logic or any of the growing ecosystem of services using the open source ELK stack of Elasticsearch (search), Logstash (log management) and Kibana (data visualization). For example, Elasticsearch can find terms that are similar to, but not exactly like, our search terms using the “fuzzy” operator or words or strings within a certain distance of each other (proximity searches). Creative use of powerful search operators can help troubleshooters identify and correlate events that aren’t necessarily in perfect temporal order in log files.
Other tools and methods
There are several other techniques and automation tools that can be valuable in isolating and solving microservices performance problems, including:
- Application performance management (APM) software designed to analyze the end-to-end performance of distributed systems and provides response time and throughput metrics from the customer’s external perspective. Some APM software can perform automatic microservice topology discovery to provide a graphical view of microservice interactions. Such topology visualization and discovery is very helpful in understanding the flow of activity and events in complex microservice-based applications. Other advanced APM capabilities include distributed transaction tracing with the ability to drill into individual microservices to see internal state and events.
- Network analysis and management software to provide packet tracing at multiple levels of the network stack, including network, IP and application protocol. For rudimentary analysis, a command line tool like tcpdump is built into most OSes. However, Wireshark is a much more powerful open source program that provides a GUI and supports customized plug-ins for analyzing new protocols.
- Synthetic transaction monitoring is a technique adapted from web transaction monitoring in which developers build test scripts to simulate typical user interactions and stress different parts of the microservice functional chain. Scripts can be regularly scheduled and can trigger alerts when performance baselines are exceeded, detecting problems before end users experience them. Synthetic transactions can trace the flow through a microservice application and show microservice interactions by embedding a correlation tag or ID into each test.
- Build metric and logging APIs into each microservice to facilitate debugging by having a consistent interface to every service.
- Build an incident response checklist by taking a cue from airline pilots who don’t leave important tasks to chance by using ad hoc procedures. Instead, specify the sequence of steps developers and IT staff should take with each performance exception and alert. A good sample troubleshooting checklist comes from Netflix, with one of the most complex sets of microservice-based applications around.
Troubleshooting microservice based software requires some new approaches and tools from those used for conventional monolithic applications. However, by aggressively logging, collecting and analyzing performance and event data, and taking a systematic approach to problem solving, adding microservices to your enterprise architecture need not lead to a never-ending maze of troubleshooting confusion.