The six pillars of cloud cost optimisation – and how to get them to work for you

Let me start by painting the picture: You’re the CFO, or manager of a department, group, or team, and you’re ultimately responsible for any and all financial costs incurred by your team/group/department. Or maybe you’re in IT and you’ve been told to keep a handle on the costs generated by application use and code development resources.

Your company has moved some or all of your projects and apps to the public cloud, and since things seem to be running pretty smoothly from a production standpoint, most of the company is feeling pretty good about the transition.

Except you.

The promise of moving to the cloud to cut costs hasn’t matriculated and attempting to figure out the monthly bill from your cloud provider has you shaking your head.

From reserved instances and on-demand costs, to the “unblended” and “blended” rates, attempting to even make sense of the bill has you no closer to understanding where you can optimise your spend.

It’s not even just the pricing structure that requires an entire department of accountants to make sense of, the breakdown of the services themselves is just as mind boggling. In fact, there are at least 500,000 SKUs and price combinations in AWS alone! In addition, your team likely has no limitation on who can spin up any specific resource at any time, intrinsically compounding the problem—especially when staff leave them running, the proverbial meter racking up the $$ in the background.

Addressing this complex and ever-moving problem is not, in fact, a simple matter, and requires a comprehensive and intimate approach that starts with understanding the variety of opportunities available for cost and performance optimisation. This where our six pillars of cloud optimisation come in.

Reserved instances (RIs)
AWS Reserved Instances, Azure Reserved VM Instances, and Google Cloud Committed Use Discounts take the ephemeral out of cloud resources, allowing you to estimate up front what you’re going to use. This also entitles you to steep discounts for pre-planning, which ends up as a great financial incentive.

Most cloud cost optimisations, erroneously, begin and end here—providing you and your organisation with a less than optimal solution. Resources to estimate RI purchases are available through cloud providers directly and through third party optimisation tools. For example, CloudHealth by VMware provides a clear picture into where to purchase RI’s based on your current cloud use over a number of months and will help you manage your RI lifecycle over time.

Two of the major factors to consider here are Risk Tolerance and Centralised RI Management portfolios.

Risk tolerance refers to identifying how much you’re willing to spend up front in order to increase the possibility of future gains or recovered profits. For example, can your organisation take a risk and cover 70% of your workloads with RIs? Or do you worry about consumption, and will therefore want to limit that to around 20-30%? Also, how long, in years, are you able to project ahead? One year is the least risky, sure, but three years, while also a larger financial commitment, comes with larger cost savings

Centralised RI management portfolios allow for deeper RI coverage across organisational units, resulting in even greater savings opportunities. For instance, a single application team might have a limited pool of cash in which to purchase RIs. Alternatively, a centralised, whole organisation approach would cover all departments and teams for all workloads, based on corporate goals. This approach, of course, also requires ongoing communication with the separate groups to understand current and future resources needed to create and execute a successful RI management program
Once you identify your risk tolerance and centralise your approach to RI’s you can take advantage of this optimisation option. Though, an RI-only optimisation strategy is short-sighted. It only allows you to take advantage of pricing options that your cloud vendor offers. It is important to overlay RI purchases with the five other optimisation pillars to achieve the most effective optimisation.

One of the benefits of the cloud is the ability to spin up (and down) resources as you need them. However, the downside of this instant technology is that there is very little incentive for individual team members to terminate these processes when they are finished with them. Auto-Parking refers to scheduling resources to shut down during the off hours—an especially useful tool for development and test environments. Identifying your idle resources via a robust tagging strategy is the first step; this allows you to pinpoint resources that can be parked more efficiently. The second step involves automating the spin-up/spin-down process. Tools like ParkMyCloud, AWS Instance Scheduler, Azure Automation, and Google Cloud Scheduler can help you manage the entire auto-parking process.

Ah, right-sizing, the best way to ensure you’re using exactly what you need and not too little or too much. It seems like a no-brainer to just “enable right-sizing” immediately when you start using a cloud environment. However, without the ability to analyse resource consumption or enable chargebacks, right-sizing becomes a meaningless concept. Performance and capacity requirements for cloud applications often change over time, and this inevitably results in underused and idle resources.

Many cloud providers share best practices in right-sizing, though they spend more time explaining the right-sizing options that exist prior to a cloud migration. This is unfortunate as right-sizing is an ongoing activity that requires implementing policies and guardrails to reduce overprovisioning, tagging resources to enable department level chargebacks, and properly monitoring CPU, Memory and I/O, in order to be truly effective.

Right-sizing must also take into account auto-parked resources and RIs available. Do you see a trend here with the optimisation pillars?
Instance types, VM-series and “Instance Families” all describe methods by which cloud providers package up their instances according to the hardware used. Each instance/series/family offers different varieties of compute, memory, and storage parameters. Instance types within their set groupings are often retired as a unit when the hardware required to keep them running is replaced by newer technology. Cloud pricing changes directly in relationship to this changing of the guard, as newer systems replace the old. This is called Family Refresh.

Up-to-date knowledge of the instance types/families being used within your organisation is a vital component to estimating when your costs will fluctuate. Truth be told, though, with over 500,000 SKU and price combinations for any single cloud provider, that task seems downright impossible.

Some tools exist, however, that can help monitor/estimate Family Refresh, though they often don’t take into account the overlap that occurs with RIs—or upon application of any of the other pillars of optimisation. As a result, for many organisations, Family Refresh is the manual, laborious task it sounds like. Thankfully, we’ve found ways to automate the suggestions through our optimisation service offering.

Related to the issue of instances running long past their usefulness, waste is prevalent in cloud. Waste may seem like an abstract concept when it comes to virtual resources, but each wasted unit in this case = $$ spent for no purpose. And, when there is no limit to the amount of resources you can use, there is also no incentive to individuals using the resources to self-regulate their unused/under-utilised instances. Some examples of waste in the cloud include:

AWS RDSs or Azure SQL DBs without a connection
Unutilised AWS EC2s
Azure VMs that were spun up for training or testing
Dated snapshots that are holding storage space that will never be useful
Idle load balancers
Unattached volumes
Identifying waste takes time and accurate reporting. It is a great reason to invest the time and energy in developing a proper tagging strategy, however, since waste will be instantly traceable to the organisational unit that incurred it, and therefore, easily marked for review and/or removal. We’ve often seen companies buy RIs before they eliminate waste, which, without fail, causes them to overspend in cloud – for at least a year.

Storage in the cloud is a great way to reduce on-premises hardware spend. That said, though, because it is so effortless to use, cloud storage can, in a very short matter of time, expand exponentially, making it nearly impossible to predict accurate cloud spend. Cloud storage is usually charged by four characteristics:

Size – how much storage do you need?
Data transfer (bandwidth) – how often does your data need to move from one location to another?
Retrieval time – how quickly do you need to access your data?
Retrieval requests – how often do you need to access your data?
There are a variety of options for different use cases including using more file storage, databases, data backup and/or data archives. Having a solid data lifecycle policy will help you estimate these numbers, and ensure you are both right-sizing and using your storage quantity and bandwidth to its greatest potential at all times.

So, you see, each of these six pillars of optimisation houses many moving parts, and what with public cloud providers constantly modifying their service offerings and pricing, it seems wrangling in your wayward cloud is unlikely. Plus, optimising only one of the pillars without considering the others offers little to no improvement, and can, in fact, unintentionally cost you more money over time. An efficacious optimisation process must take all pillars and the way they overlap into account, institute the right policies and guardrails to ensure cloud sprawl doesn’t continue, and implement the right tools to allow your team regularly to make informed decisions.

The good news is that the future is bright. Once you have assessed your current environment, taken the pillars into account, made the changes required to optimise your cloud, and found a method by which to make this process continuous, you can investigate optimisation through application refactoring, ephemeral instances, spot instances and serverless architecture.