5 ways to lose data on Kubernetes—and how to avoid them
If your critical databases and stateful applications are migrating to Kubernetes, the sorrows of accidental data loss aren’t far behind. And data is the lifeblood of modern applications.
For database people, “data” means any persistent state that the database depends on to function properly. Think tables in MySQL, znodes in Zookeeper, or column storage in a data warehouse. If the bits are gone, applications that depend on them stop working.
“Loss” means the data is gone or at least unreachable. Gone forever is, of course, the worst case. But databases can lose data in other ways. You could lose some of your data, such as the records for all of your e-commerce sales in the last 24 hours. Or your data might not really be lost, just inaccessible for a period of time. Imagine not being able to access your bank account to withdraw money. When thinking about data loss, consider all scenarios and prevent them to the extent possible.
So what can you do to protect your data? As longtime database engineers and authors of a Kubernetes operator for a popular database, our team has explored how you could lose data on Kubernetes. We also learned how to prevent it.
Here are the five most common ways your organization could lose data on Kubernetes, and how to make sure it doesn’t happen to you.
[ Learn how to add AIOps to your playbook in TechBeacon’s Guide. Plus: Download two reports: AI is Transforming the Role of IT | The State of Analytics in IT Ops ]
1. The single-copy catastrophe
As anyone who has lost a laptop knows, the time-honored way to lose data is to have just one copy. This isn’t specific to Kubernetes, of course, but Kubernetes makes it easy to set up databases that you can lose later. Starting a MySQL database is as simple as the following command:
helm install prod-mysql stable/mysql
On most Kubernetes implementations, you’ll see a running MySQL database with a single persistent volume (PV) in a few seconds. But this easy setup hides a dangerous fragility. File system corruption, hardware failures, accidental data deletion, or administrative error can all eventually erase your data.
It’s a catastrophe waiting to happen.
Databases have addressed this problem for decades with backups. The best backups store data from a specific point in time plus a log of changes since your last backup so you don’t lose your most recent data.
Backups take time to restore and don’t work well for large volumes of data, however, so modern databases often depend on replicas to prevent data loss. Instead of a single database, you might have one or more live replicas that share changes automatically. If one replica fails or is damaged, your applications can quickly switch to another. This is a powerful way to avoid even temporary data loss.
The first step to preventing data loss on Kubernetes therefore is to use backups and/or replicas as is suitable for your database. Find and use good tools. Beyond good procedures, active paranoia is important. This is true no matter where your databases run.
2. Blast-radius blues
Having copies opens the door to another kind of data loss: losing your original data and your copies. One of my favorite terms from high-availability engineering applies here: It’s the idea of a blast radius. Say you have live replicas that are using network-attached storage from a single storage array. If that array fails, all of your replicas disappear as well, because they fall within the blast radius.
The obvious cure is to separate your copies by distance. One common approach: Put your replicas in separate data centers several miles apart. Kubernetes supports such “availability zones” well. Managed Kubernetes offerings such as Amazon EKS and open-source tools such as kops can easily set up hosts across availability zones. Use them in every production deployment.
While availability zones are a good step to protect copies, that’s not always enough. The best level of protection against blast-radius problems is to spread data across geographic regions. If you need very high assurance, spread your replicas across regions with fully separate Kubernetes clusters. At the very least, ensure that your backups are outside the blast radius of your primary region.
[ Find out what your team needs to know to roll out robotic process automation in TechBeacon’s Guide. Plus: Get the white paper Enterprise Requirements for RPA ]
3. Affinity afflictions
Separate availability zones are great, but only if you use them. Fortunately, Kubernetes offers two powerful features that can help you distribute data across zones. Affinity “pins” databases (including data) to specific locations. Anti-affinity pushes databases away from locations. But you must define rules carefully or they will not place data where you expect.
Affinity settings are very flexible and offer multiple ways to achieve proper placements. The best way is to let users explicitly assign replicas to specific availability zones. This is the easiest pattern to understand.
Kubernetes recently introduced operators to make it easier to handle complex configurations such as affinity and anti-affinity rules. It’s also becoming a popular way to set up databases in Kubernetes with best practice configuration out of the box.
If there’s an operator available for your database, check it out. Otherwise, look for other tooling to set up the database so you don’t have to start from scratch. As always, stay paranoid. Test any tools you use carefully to ensure they handle failures. It’s easy to set up test scenarios quickly using Kubernetes, so lean on its capabilities.
4. The persistent volume that isn’t
Kubernetes now has a wide variety of storage providers to help you deploy data on a variety of storage types. That’s great for databases, which depend on the right sort of storage to function correctly. But the variety opens the door up for mistakes that can lead to potentially severe data loss problems.
One such problem we’ve seen many times is simply failing to allocate persistent storage in the first place. Kubernetes pods will come up quite happily even if you fail to allocate a PV. Processes in the pod run normally, but you’ll lose data if the pod restarts.
Another issue surrounds the use of local volumes versus network-attached storage. Using a local volume can give you access to high-performance non-volatile memory express (NVMe) solid-state disk (SSD), but exposes you to data loss if the pod can’t be scheduled on the original host. In that case you’ll need to restore data from backup or from another replica. Make a conscious decision based on the trade-offs.
Regardless of how you allocate storage, be sure to test. Ensure that you’ve allocated PVs with the expected type and size. And check pods internally to ensure that the file system mount points match the expected size of the persistent volume. This is another area where a Kubernetes operator can help by allowing you to use a well-tested method to allocate storage.
5. The fat fingers of fate
There’s one final, disturbing way to lose a lot of data quickly: Utilities like helm set up databases in seconds—and remove them just as fast. It’s this easy in MySQL:
helm delete prod-mysql stable/mysql
To keep PVs from disappearing, start by enabling “Retain” as the reclaim policy on your production data. That way your PVs won’t disappear if you accidentally delete your database deployment. But that’s not enough for databases that have many nodes. In that case you’ll also need to add tags to your PVs to help find and reattach them properly when recreating the cluster.
Such “fat finger” errors are yet another great example of how automation through Kubernetes operators or other deployment tools is essential to protect data. Tooling can build in guardrails against errors and help you recover more easily when problems do occur.
Gain from Kubernetes without the data loss pain
Kubernetes is a great environment for distributed applications. It has a great future managing data. Yet as your databases and other stateful applications move to Kubernetes, you face the real risk of data loss. This stems from causes native to databases as well as issues specific to Kubernetes.
Fortunately, you can manage the problem by applying a thoughtful combination of sound database management, tools such as Kubernetes operators that enforce best practices, and plain old paranoia.
Don’t forget to leverage the strengths of Kubernetes itself to prevent data loss. The foundation is there. The rest is up to you.
Want to know more? During my KubeCon + CloudNativeCon Europe conference session, “Five Great Ways to Lose Data on Kubernetes (And How to Avoid Them),” I’ll go into more detail with detailed technical examples of how to avoid data loss on Kubernetes.