IT incident response ditches root cause analysis process

Source –

NEW YORK — IT incident response must change to keep up with DevOps. The root cause analysis process embraced by many enterprise sysadmins should be the first thing to go.

In the world of monolithic legacy applications, root cause analysis — identifying the specific line of code, switch port or hard drive that set off a domino effect to cause an outage — is the first step during an IT incident response. But as apps evolve into microservicesdistributed over a complex network infrastructure, IT pros at the Velocity Conference 2017 here this week said this approach to problem-solving no longer works.

“Focus on mitigation and not root cause” as a crisis begins, said Kristopher Beevers, founder and CEO of hosted DNS provider NS1 in a Velocity presentation. “Identify and troubleshoot the service impact and worry about doing a diagnosis later.”

Down with root cause, up with service impact

A key sentiment heard here this week was that IT pros must develop a better sense of the services and functions that are priorities for the business and end users.

If the homepage lists of top articles at the top of the Financial Times website are broken, for example, that’s the thing that matters for Sarah Wells, principal engineer for the financial industry newspaper, based in London.

“We’ve turned off alerts for service-level errors, status codes and response times,” Wells said. “We start [our analysis] at the top of the stack.”

At scale, other members of the IT team become customers of microservices offered via APIs, said Mark McBride, founder of Turbine Labs, a decision support analytics software maker in San Francisco. McBride is a former developer and services engineer for Twitter, Nest Labs and Google.

Splitting responsibility for the infrastructure in this way reduces IT pros’ anxiety during incident response, McBride said. Individual team members don’t worry about everything behind each API that their service communicates with, but they can easily observe the behavior of the API their service calls.

This also gives IT teams a common point of observation and control over systems that span languages and runtimes, and they can apply remediations globally during IT incident response, he said. Moreover, it reduces the importance of a root cause analysis process to find the particular element of the infrastructure that malfunctions behind an API.

The post-postmortem era and the case for incident review

Root cause analysis often discourages IT from finding long-term solutions post-incident and improving troubleshooting processes, said some DevOps engineers here.

“It’s common for a human to be blamed as the root cause of a problem” during this process, said Baron Schwartz, co-founder and CEO of VividCortex, a database monitoring SaaS provider. “Then companies end up firing people instead of improving things, and ‘cover your [butt]’ becomes the priority instead of solving problems.”

For many companies, a more effective IT incident review approach remains a work in progress, but they’ve already begun to move away from the root cause analysis process.

“You often don’t arrive at a single thing, and it’s more important to understand the context, contributing factors, and how to mitigate and remediate the issue,” said Peter Nealon, a solutions architect at Runkeeper, a mobile running app owned by Japanese athletic equipment retailer ASICS.

Some IT pros think the term postmortem to describe the final retrospective phase is passé, and they favor the term incident review.

“We’re learning how to improve versus figuring out what died and why,” Nealon said. “We aren’t aiming for root cause analysis, but to reduce our meantime to remediation.”

Nealon’s team will soon institute an IT incident response process that uses an incident commander to direct troubleshooting, he said. Ideally developers on the application team will handle troubleshooting tasks and Runkeeper’s site reliability engineers will serve as subject-matter experts on the underlying platform.

“Our assumption will be that it’s most likely an app issue and not an infrastructure or platform issue,” Nealon said.

Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments
Would love your thoughts, please comment.x