DevOps & machine learning improves app performance!
Source – devopsonline.co.uk
Machine learning needs big data
When trying to understand and optimise application performance, more data is better. Whether testing an upcoming release or troubleshooting a reported problem, teams and tools often use data samples and averages to create a baseline of normal behaviour and isolate the outliers. However, data samples at one-minute or five-minute intervals in a sub-second world not only miss many of the outliers, they often misrepresent them. Big data, collecting system and resource data every second from all apps, all devices, all transactions, is essential for training machine learning algorithms. Otherwise, they are learning the simulation, not the real world.
Machines must be taught, again and again
Teaching is an iterative process, isolating sets of transactions, isolating commonalities, removing transactions with that characteristic, and evaluating the remainder to see if the issue is still present. This enables the tool to catch unanticipated issues which are incorporated into the list of automatic insights. The bigger the data and the more metadata and richness in it, the more accurately the system can pinpoint patterns and causes.
APM big data and machine learning example
For example, the chart below is a plot of detailed response-time data points collected from 10,000 individual transactions. Humans see a mostly ‘normal’ distribution curve, with a sizeable collection of outliers. Trying to sort these into logical sets would take the DevOps or QA team weeks or months, long after the related issues have impacted the business.
Machine learning can quickly group the outliers into sets, excluding transactions that have no exceptions and identifying areas for further investigation, such as web service timeouts, database timeouts, authorisation timeouts, and initialisation failures. Exploring further, machine learning iteratively groups and regroups sets, looking at all interdependencies, isolating commonalities, and providing details of methods, database calls, servers, or other characteristics that are consistently associated with underperforming transactions. Performing these tasks much faster than humans, the computer learns the most likely characteristics to isolate based on statistical results instead of guesses and anecdotes.
Help for DevOps teams
With a big data foundation, machine learning systems can synchronise multiple data sources based on their metadata, linking server statistics, log file entries, and transactions. Armed with this info, DevOps teams can quickly identify and prioritise the users, transactions, and application components with the biggest impact on the business. Historical data enables rapid comparisons of performance before and after releases, providing insight on the impact of development changes and the effectiveness of testing coverage. Machine learning techniques can also turn complex interdependencies into graphical visualizations that humans find much faster and easier to process than reams of tabular data.
For example, when a series of anomalous transactions are identified, detailed data enables the machine to compare key metrics and examine individual users, transactions with similar errors, or transactions running through a specific geography. Linking logs to transaction data synchronise the data, enabling DevOps to quickly drill into the offending method, server, or network.
In the chart above, machine learning separates consistent and inconsistent characteristics, isolating Method C as the single biggest cause of slowdowns. An important truism is that all slow transactions may be slow for different reasons, and machine learning automates the early stages of data analysis and discovery, enabling the team to focus their resources on resolving the most impactful issues.
Machine learning: making sense of APM big data
This is just the beginning of the machine-learning era for APM. However, existing machine learning techniques are already delivering significant value to DevOps teams by analysing APM big data and surfacing the probable causes of performance issues. This rapid identification of issues improves performance reliability, reduces bug rates when used in development environments, and increases user satisfaction with their online experiences.
Data quality matters. Because big data is the foundation of machine learning, it is important to collect all relevant metrics as well as the associated metadata before applying machine learning, and not rely on sampling algorithms that may miss issues in dynamic application environments.