AIOps and the Kubernetes Revolution

Source: datanami.com

Kubernetes has emerged as the defacto standard for container orchestration in the new cloud architecture. There’s no doubt about it. But hidden from view in the race to K8s adoption is a new form of complexity that threatens to overwhelm human operators.

The way Kubernetes allows users to create, grow, move, and kill entire containerized Linux environments with a few button clicks is revolutionary. The software, which started out as a Google project called Borg, makes it stupidly simple to enable compute containers to span hundreds or thousands of computers, without those applications being physically tied to the physical machines.

Compared to the first generation of virtualization technologies from VMware, this is nothing short of amazing. It’s no wonder that VMware just announced a wholesale of adoption of Kubernetes in its flagship vSphere virtualization software last week at VMworld. Kubernetes is clearly the future for enterprise-level, scale-out computing.

As Kubernetes spreads across the IT world, organizations will use it for a range of different applications. From Web servers to training AI models, Kubernetes is the underlying strata upon which the next-generation of applications is being built. This has big repercussions for big data, as Kubernetes and S3-compatibile object storage systems are widely seen as the replacement for Hadoop (which is based on Google technology that is approaching the 20 year mark).

Despite the large amount of hype around Kubernetes, it’s not all honeydew and sunshine. The technology may simplify deployment of distributed apps in cloud, on-prem, and hybrid environments, but the technology also introduces its own set of complications. Bernard Golden, the head of cloud strategy for CapitalOne, touched on them in a 2018 Medium piece:

“While Kubernetes-based applications may be easy to run, Kubernetes itself is no picnic to operate,” Golden wrote. “It is an early stage product which is serving as the foundation of an ecosystem of supplementary products (e.g., Istio) all of which must be installed, configured, and managed correctly to make applications run properly.”

Some of this has to do with Kubernetes being relatively young. But it’s also an unwanted side effect of the very thing that makes Kubernetes so great – the ease of spinning up vast ephemeral cloud environments for scaling applications to the moon. The problem is that traditional ways that IT shops monitored enterprise applications aren’t going to work with the new Kubernetes paradigm.

Just as VMware’s technology exacerbated the systems monitoring business with its new abstraction layer, Kubernetes and the microservices revolution will bring its own share of difficulties to the “Ops” side of the DevOps continuum. It’s also opening up business opportunities for companies with machine learning expertise, who are rolling out new monitoring tools that are infused with AI.

“If your organization is currently making the transition to the cloud, this is a great time to also switch to a more container-based architecture,” Dominic Wellington, the director of strategic architecture at AIops vendor Moogsoft, wrote in an October 2018 blog post. “However, without up-to-date monitoring systems capable of overseeing huge data streams, it will be impossible to maintain the steady uptime and efficiency that enterprise systems require. Even then, your legacy monitoring tools may lack the capability to penetrate and make sense of the data arising within and between containers.”

(fullvector/Shutterstock)

Kubernetes monitoring has become one of the more popular use cases for Anodot, a machine learning software vendor whose specialty is analyzing time-series data. According to Yuval Dror, Anodot’s head of DevOps and site reliability engineering (SRE), it has become almost impossible to troubleshoot problems in Kubernetes environments using traditional tools.

“When the system is so huge, and you have so many instances that are up, and you dynamically move and add instances on a daily basis, it’s becoming very ridiculous to even try to find issues manually,” Dror says. “It doesn’t make sense these days to observe everything, to track all kinds of metrics in the system. It’s getting very hard. It’s almost impossible to find issues” using existing tools and dashboards.

Instead of tracking application metrics like CPU and memory usage visually using big bright dashboards, operations folks are beginning to rely on machine learning algorithms to detect anomalies in the Kubernentes environment. “It works very well,” Dror tells Datanami. “It’s my personal feeling and I think the company’s vision, that eventually this will be the only way to monitor huge clusters.”

Anodot uses Webhooks to grab application data from Kubernetes. Over time, as more data is fed into the machine learning model, it learns what constitutes a normal environment on different timescales, including season-to-season changes. When something goes wrong with the underlying hardware, the operating system, or the application, it will show up in the metrics that Anodot is tracking, and the software will send an alert.

This technique is one step removed from the root cause of problem, but the folks at Anodot aruge that this sort of correlative approach is the best way to detect problems in increasingly complex and virtualized clusters.

Anodot has several customers using its software for monitoring Kubernetes environments, including a large teleco. The company is also working on building new rules-based management capabilities into the software that will automatically take action in response to certain conditions detected in the Kubernetes environment.

In Dror’s view it’s just a matter of time before this sort of machine learning-based approach to monitoring and management of containerized environments goes mainstream.

“People are moving in this direction because you get better performance, better quality, and all this automation is already built in,” he says. “You don’t have to develop it yourself. Eventually you save a lot of money, because you need less people to manage the cluster.”