IT incident response ditches root cause analysis process
Source – techtarget.com
NEW YORK — IT incident response must change to keep up with DevOps. The root cause analysis process embraced by many enterprise sysadmins should be the first thing to go.
In the world of monolithic legacy applications, root cause analysis — identifying the specific line of code, switch port or hard drive that set off a domino effect to cause an outage — is the first step during an IT incident response. But as apps evolve into microservicesdistributed over a complex network infrastructure, IT pros at the Velocity Conference 2017 here this week said this approach to problem-solving no longer works.
“Focus on mitigation and not root cause” as a crisis begins, said Kristopher Beevers, founder and CEO of hosted DNS provider NS1 in a Velocity presentation. “Identify and troubleshoot the service impact and worry about doing a diagnosis later.”
Down with root cause, up with service impact
A key sentiment heard here this week was that IT pros must develop a better sense of the services and functions that are priorities for the business and end users.
If the homepage lists of top articles at the top of the Financial Times website are broken, for example, that’s the thing that matters for Sarah Wells, principal engineer for the financial industry newspaper, based in London.
“We’ve turned off alerts for service-level errors, status codes and response times,” Wells said. “We start [our analysis] at the top of the stack.”
Splitting responsibility for the infrastructure in this way reduces IT pros’ anxiety during incident response, McBride said. Individual team members don’t worry about everything behind each API that their service communicates with, but they can easily observe the behavior of the API their service calls.
This also gives IT teams a common point of observation and control over systems that span languages and runtimes, and they can apply remediations globally during IT incident response, he said. Moreover, it reduces the importance of a root cause analysis process to find the particular element of the infrastructure that malfunctions behind an API.
The post-postmortem era and the case for incident review
Root cause analysis often discourages IT from finding long-term solutions post-incident and improving troubleshooting processes, said some DevOps engineers here.
“It’s common for a human to be blamed as the root cause of a problem” during this process, said Baron Schwartz, co-founder and CEO of VividCortex, a database monitoring SaaS provider. “Then companies end up firing people instead of improving things, and ‘cover your [butt]’ becomes the priority instead of solving problems.”
For many companies, a more effective IT incident review approach remains a work in progress, but they’ve already begun to move away from the root cause analysis process.
“You often don’t arrive at a single thing, and it’s more important to understand the context, contributing factors, and how to mitigate and remediate the issue,” said Peter Nealon, a solutions architect at Runkeeper, a mobile running app owned by Japanese athletic equipment retailer ASICS.
Some IT pros think the term postmortem to describe the final retrospective phase is passé, and they favor the term incident review.
“We’re learning how to improve versus figuring out what died and why,” Nealon said. “We aren’t aiming for root cause analysis, but to reduce our meantime to remediation.”
Nealon’s team will soon institute an IT incident response process that uses an incident commander to direct troubleshooting, he said. Ideally developers on the application team will handle troubleshooting tasks and Runkeeper’s site reliability engineers will serve as subject-matter experts on the underlying platform.
“Our assumption will be that it’s most likely an app issue and not an infrastructure or platform issue,” Nealon said.