5 Ways to Streamline Kubernetes Alerting

Source:-it.toolbox.com

The vast majority – more than 80 percent – of companies using containers now also use container orchestration software, and 32.5 percent of those companies are using Kubernetes (K8s).

The growing adoption of K8s is only hindered by the fact that not everyone knows how to use it. A recent survey found that around 50 percent of users believe that they’re not skilled enough to use K8s – and that the platform is too complex for them to grasp.

In order to finish its mission of conquest, Kubernetes needs something to make it simpler and more stable to use – to smooth out the rough edges, in other words. One area that could use help? Alerting.

Right now, there’s no good way to set alerts automatically – when you try it, you either get flooded in false positives or miss genuine anomalies. What’s more, the alerts don’t tend to point at the source of actual problems. Here are some helpful tips on how to streamline the alerting process and make K8s easier to use.

Table of Contents

1. Detect Application Issues by Tracking the API Gateway for Microservices

One thing to notice about K8s is that it’s highly abstracted from the physical layer. As such, metrics like CPU, load, and memory are hard to utilize. Small-scale issues with these metrics will be handled by the K8s auto-healing mechanism, which will scale out and reboot pods as needed.

Instead, worry about APIs, which are a much more fundamental yardstick for K8s and its pods. Increasing call errors, latency and request rates all point towards degradation in a component. In addition, these metrics tend towards different baselines for each pod, cluster, region, and data center — making AI the only practical way to keep track of them.

How do you track your baseline? Look to your ingress controllers, which will produce service-level metrics and allow for granular alerting at any level of any API. AI analytics will be able to look at this data and alert on anomalies that would be hidden by static thresholds or a changing baseline.

2. Stop Measuring Individual Containers

Part of the appeal of a K8s system is that it can create, destroy and scale containers largely on its own. As a result, there’s almost no point in measuring the resource consumption of an individual container or pod – the baseline changes quickly and then it’s gone. Outlier detection is the one exception here — it’s used to avoid FP in case the entire cluster KPI changes.

For the most part, you want to monitor the entire set of containers. In this context, cAdvisor is a useful tool that can provide resource consumption metrics such as CPU, memory, and network usage. By using the data from cAdvisor in combination with AI analytics, you’ll be able to alert on the most worrisome resource usage – no more crawling out of bed every time a CPU peaks.

3. Understand Whether Hiccups are Actually Trends

Your containers will fail every so often even under normal conditions – but are these failures symptoms of an underlying cause?

All K8s pods have a status field which shows where they’re at within the pod lifecycle. A pod can either be pending, running, succeeded, failed or unknown (this means that the host cannot communicate with the pod). If a pod is failed, you can query the reason, which will be a basic error code saying, for example, that the pod was using too much memory.

Using your monitoring tool, you can begin to aggregate the status of your pods and the reasons that they fail over time. Once a variety of reasons start to accumulate or to repeat regularly, it’s time to address systemic issues.

4. High Disk Usage Can Never Be Ignored

Most K8s administrators already set an alert when disk usage reaches 75-80 percent. This is a no-brainer: high disk usage (HDU) can affect every K8s cluster — usually representing an application issue — and will always require your personal attention. This is especially true if you’ve containerized your database — HDU issues in StatefulSets can be catastrophic.

If everyone is already alerting on HDU, why are we mentioning it in this article? It’s simply because using AI analytics can give you more time to “gird your loins”, as it were. Getting early warning on meaningful changes in disk usage can allow more time for you to fix issues with your application before unexpected downtime ensues.

5. You Can Never Not Monitor Kube-System

As we mentioned in the introduction, many people find K8s complicated. They’re right! The actual Kubernetes System can throw off some complicated errors. Issues might occur due to DNS bottlenecks, network overload and massive etcd failures.

Once a cluster fails, getting it back online is a project, to say the least. Best to avoid it. To do this, identify issues before they happen. In particular, track the load average, memory and disk size. In addition to resource usage – see tip #2 – you need to monitor patterns that are specific to the kube-system itself.

Fortunately, you can scrape metrics directly from the kube-system using the “/metrics” resource. Here’s what you should track:

DNS KPI:

Total DNS Requests – no matter what flavor of DNS you’re using, this metric may indicate a resources issue, scaling limits or an application bug.
DNS Request Time – latency is your enemy. Not only will it aggravate users on the front end, but it can also break an application on the back end.

ETCD KPI:

Etcd is fairly robust and can recover from multiple container failures with only a small outage. Too many failures will cause unrecoverable downtime, however. Alerting on failed proposals will help you understand what to worry about.
Snapshots represent the process of etcd backing itself up. If snapshots take too long, then you may have a problem on your disk.
An increase in message failures suggests network issues that have the potential to topple the cluster itself.

Conclusion

Kubernetes is a fantastic tool and framework, and absolutely worth the time you spend working on it. Its resilience, plus its ability to remediate minor issues automatically, makes it an ideal platform for a revolutionized application architecture.

By adopting automated AI monitoring, you help to make this platform even more stable – and much easier to use and maintain. It’s our hope that AI monitoring will become one of the default integrations with the K8s platform, vastly improving the health of complex implementations.