Why operations metrics and telemetry matter in DevOps too

Source – techtarget.com

I am not a faddist or a Kool-Aid drinker. I’m a skeptic by career. But monitoring and observability is so “hot right now” and, well, validly so. We have to scientifically consider how to create systems that are more reachable and more observable and can tell us more under scrutiny. Charity Majors, among others, talks about this, and I highly recommend her work.

But it’s time to elevate operations metrics and monitoring and make it a first-class systems development lifecycle (SDLC) citizen — by moving it to the left.

Shift left testing
We’ve talked about “shift left testing” and the promise of a new “Testing Manifesto” before at DevOps Agenda.

The underlying reasoning is that if you test in parallel to development, you’ll reduce your feedback loop (i.e., faster feedback). By knowing and trusting your feedback earlier, you can make required changes and fixes that much earlier. It’s an efficiency approach aimed at the reduction of waste.

But what about the rest of the SDLC and continuous delivery (CD) activities? What does this mean for security and operations?

Reduce your feedback loop
If you’ve cut the feedback loop between development and testing, how do you think operations is doing, still tacked onto the end?

Before we started doing Agile and DevOps, we had monolithic projects with monolithic phases. Development would take months and testing would take months. Then reviews and approvals would take months, and finally after (if you were lucky) six months or (up to) two years, you would go live in a real production system on real infrastructure. That was the first time your real production monitoring actually started.

Consider that feedback loop.

That is one reason more of us are now doing Agile and DevOps — to reduce the size of sections of work and feedback loops, and that of course, enables us to deliver value faster to our customer. There are more reasons, but this is a primary one.

Yet sometimes, even as we automate all our tests, deployments, infrastructure and emulate a near perfect DevOps utopia, we have not treated operations metrics and the monitoring of production systems (also called “telemetry” in Gene Kim’s The DevOps Handbook) as equal to the rest.

Sometimes, operations folks — whether due to others’ choices or their own — have to slap on production telemetry at the very end as the system hits production. In these cases, essential system behavior has been neither defined, implemented nor tested early. The system hasn’t been tested for validity, feasibility or baselined early. And sometimes this could result in continuous blips or even significant production outages and service-level agreement (SLA) departures.

Telemetry and code in tandem
Consider defining your essential operations metrics — included under the general banner of nonfunctional requirements — as early as your functional requirements. Include in your specification what the system will do, how it will do it, and allow your developers to build that in as early as the functional code and your automated testing.

Whether you deploy first to a test environment or straight to production in continuous deployment, you want your telemetry to work early and meaningfully. You should add telemetry as you add more code. Do it in tandem and reap the rewards of measuring and understanding your operations metrics from the start.

System administrators will tell you how difficult it is to deal with meaningless, false or noisy alarms, or even false positive no-alarms. So start testing and fine-tuning your operations metrics early and tweak and learn from them continuously, and get your machine humming.

This is never an absolute or a finite destination in DevOps, but if you can shift operations left, in line with development and testing, you’ve saved yourself from significant waste and redo.