Efficient MLOps in a Kubernetes Environment
MLOps addresses the specific needs of data science and ML engineering teams without impacting Kubernetes
If your organization has already started getting into machine learning, you will certainly relate to the following. If your organization is taking its first steps into data science, the following will illustrate what is about to be dropped on you.
If none of the above strikes a chord this might interest you nonetheless, because AI is the new frontier and it won’t be long until you’ll need to address the following challenges, too.
Data scientists are … well … scientists. If you ask them, their focus should be on developing the science, the neural networks, and the models behind the AI predictors they are tasked to build. They all have their preferred ways of working and they may need special environments for their required tasks. Of course, your colleagues, the data scientists would love to be able to develop their code on their laptops. Unfortunately for them, this is not possible. Their machines either lack memory or storage, or they need GPU or several GPUs for added horsepower. After all, AI workloads can be humongous compute hogs. Thus, they have to work on remote machines. Their code will be running from several hours to several days, and since it’s quite long, they obviously want to try different configurations, parameters, etc.
Here comes DevOps to save the day! Allocate a Kubernetes cluster for them and let them run their containers.
If only it was that easy …
The crux lies in the fact that AI development and deployment are not like standard software. Before delving into the differences, let’s remember that K8s clusters were built for production. As such there are several big no-nos that we are all familiar with. Specifically, on a K8s cluster we should never:
Place all pods in the same namespace.
Manage too many different types of pods/repos workloads manually.
Assume we can control the order of the K8s scheduler.
Leave lots of stale (used once) containers hanging around.
Allow node auto-scaling to kick automatically based on pending jobs.
Basically, what I’m trying to say is that the first rule of DevOps is, Do not allow R&D teams to access your K8s cluster. This is your cluster (read: production cluster), not theirs.
However, with deep learning (DL) and sometimes with machine learning (ML), the need to run heavy workloads is required from relatively early in the development process, and it continues from there. To top that, unlike traditional software, wherein a tested and usually stable version is deployed and replicated on the K8s cluster, in ML/DL the need is to run multiple different experiments, sometimes in the hundreds and thousands, concurrently. Those are, by definition, non-production grade tested and stable pieces of software.
In other words, unless we want to constantly set up permissions and resources for the continuously changing needs of the data science teams, we have to provide an interface for them to use the resource we allocated for them.
Finally, since we are the experts on K8s and orchestration in general, the data science team will immediately come to us to support them in Dockerizing their code and environment.
When things are simple, everything works, but things get out of hand very quickly. Because ML/DL code is unstable and packaging sometimes takes time, we will need to have it as part of the CI. This means that the data science team will have to maintain requirements/YAML files. As these experiments are mostly ephemeral, we will end up building lots and lots of Dockers that will be used only once. It is not uncommon to see clusters with tens of thousands or more Dockers on them.
The long and short of this is that we need someone or something that easily Dockerizes the data science team’s endless environment setups. Continuously.
Let’s decompose the requirements list into its different ingredients:
For DevOps, resource access means permissions/security, reliability, location, etc.
For a data scientist, resource access means which resource type to use and its availability. That’s it. Even a resource type that could be defined at low granularity is usually an overkill from the development teams’ perspective. What they care about is whether it is a CPU or a GPU machine and the number of cores. Three- or four-level settings for these resources (e.g. CPU, 1xGPU, 2xGPU, 8xGPU etc.) would probably be enough for most AI teams.
Environment packaging is usually thought of as containerizing your codebase. This is a reasonable assumption coming from a DevOps perspective.
For AI data science teams, maintaining a Docker file, a requirement.txt and updating Conda YAMLs is possible. However, it’s a distraction from their core work, it takes time and it is easy to leave behind old setups because they are not used in the coding environment itself. The development environment is constantly changing, and so the key is to extract the information without the need to manually keep updating it. Also, easily replicating environments from local machines to remote execution pods is needed.
For DevOps, monitoring usually means hardware monitoring, CPU usage, RAM usage, etc.
For data science teams, monitoring is about model performance, speed, accuracy, etc. AI teams need to be able to monitor their applications (processes/experiments, whatever we call them) with their own metrics and with an easy-to-use interface.
Unfortunately, no standard exists for this kind of monitoring, and oftentimes adding more use case-specific metrics gives the data science team a huge advantage in terms of understanding the black box. This, too, is a constantly changing environment that needs to allow for customization by the data scientists.
For DevOps, using K8s job scheduling is akin to resource allocation, e.g. a job needs resources that it will use for unlimited amounts of time. If we do not have enough resources, we need to scale.
For data science teams, job scheduling is actually an HPC challenge. Almost by definition, there will never be enough resources to execute all the jobs (read: experiments) at the same time. On the other hand, jobs are short-lived (at least relative to the lifespan of servers), and the question is which job to execute first and on which set of resources?
MLOps to the Rescue
Kubernetes is a great tool for DevOps to manage the organization’s hardware clusters, but it is the wrong tool to manage AI workloads for data science and ML engineering teams.
So, is there a better way?
To address the specific needs of data science and ML engineering teams, a new breed of tools has been designed: MLOps or AIOps.
MLOps is an ideal tool for the data science team; in a K8s-centric organization it will interface or run on top of K8s. From a K8s perspective, it should be just another service to spin, with exact multiple copies and replicated setup on different resources, just like we would do with any other application running on our K8s.
These tools offer the data science team and ML engineers a very different interface with the K8s nodes. Ideally, the AI team should be able to see their allocated hardware resources as nodes on their own “cluster,” for which they are given the ability to launch their jobs (experiments) directly into a dedicated set of queues without interfering with any of the K8s schedulers.
The MLOps solution provides the glue between the queues managed by the data science team and K8s: resource allocation provisioned by Kubernetes and job allocation and execution performed by the dedicated MLOps solution.
This solution should be able to pick a job from the dedicated ML/DL team queue (with priorities, job order and a few more things), then set up the environment for the job, either inside the Docker or as sibling Docker, and monitor the job (including terminating it if the need arises).
The DevOps team is only needed when new hardware resources need to be allocated or resource allocation needs to be changed. For everything else, the users (read: AI team) self-service through the MLOps solution, purposely built for their needs, while letting the DevOps team manage the entire operation.
Doesn’t that sound better than being on call 24/7 for provisioning and then having to clean everything up afterward?