The Basics of CI/CD for Data Science and Machine Learning
Continuous integration and continuous deployment are IT practices that encourage testing code often. Learn how these practices also shape data-driven initiatives.
The basics behind how machine learning and data science should work often feel less than basic. Machine learning practitioners from programmers to scientists are learning how to apply advanced statistics and mathematical application within the context of software programming. The result is complexity in selecting good machine learning models that conflict with management’s options at hand, be it objectives deadlines or limited resources to execute a decision based on the model.
Fortunately a few developer practices — continuous integration and continuous deployment (CI/CD) — are providing managers with ways to lead machine learning and data science initiatives early in a development process, making truly beneficial model-based decisions possible.
Let’s look at the definition of CI/CD to understand how the paired processes impact machine learning.
Continuous integration is a practice that ensures that code and any related resources are placed into a shared repository at regular intervals of time. These check-ins are next verified using automated builds, helping to highlight any problems early in the development cycle.
Continuous deployment is a practice in which software updates are built automatically, tested, and made ready for release. With developers and database teams working collaboratively and in parallel, continuous deployment paves a way for stable and consistent versions of software.
CI/CD is valuable because today’s business strategies have become reliant on how the ongoing nature of software management impacts the development of products and services. The consequential agility needed to deliver functional software has transformed the software itself into microservice architectures. Microservices are a set of development techniques that arrange an application as a set of coupled services. Maintaining microservices permits software releases to be deployed frequently, even multiple times a day, without interrupting other software segments. The advantage to a business model is being able to provide seamless updates.
The seamless updates of microservices can also complement data-related changes, such as adding software updates that meet privacy compliance needs with any associated data. The update capability allows data science and machine learning processes to be incorporated into CI/CD phases at the right time.
As a consequence, CI/CD-influenced projects have the opportunity to minimize technical debt, the tendency to overfocus on code syntax without considering the long-term consequences to programming maintenance and its impact on the business model. For example, a team could develop an app, but not examine the steps needed to update the environment in which the app operates. Technical debt is the enemy of organizations that have multiple deployment environments (e.g., development, testing, production). Technical debt is also the enemy of data-driven initiatives, since data deployment environments are demonstrating similar concerns that arise in software development, such as API documentation — in this case from data resources — as well as different data types. Getting an overview of needed data mining and transformation steps can become complex very quickly.
So where within a development process can managers contribute to a CI/CD process to help simplify the complexity? One great opportunity is through evaluating test processes like user acceptance testing (UAT), a test phase that evaluates user needs, business requirements, and software functionality. Managers can help the test team set the evaluation parameters for business requirements, leading to a robust methodology for evaluating continuous improvement of those parameters. A project manager is usually assigned to work with developers on this effort. UAT can be effective in reducing development time and expenses, while CI/CD can inform data management on how development of a model output can potentially impact customer experience with a service or product.
Experts indicate other opportunities for managers to apply CI/CD practices are emerging. Ben Lorica, chief data scientist at O’Reilly Media, noted in his O’Reilly Strata conference keynote that tools specialized for machine learning will layer onto existing analytics. The trend will allow teams to increment their capabilities and experiment with other architectures. Recent announcements by Microsoft Azure, Amazon Web Services, and Google, for example, emphasize faster model training, better workflow management, and greater security for project deployment.
Evaluating the programming used for those projects can aid in selecting complementary IDEs and regular needs among teams. If a team had used R programming to develop models, for example, a version control system would be needed to keep packages and dependencies updated and a documented history on changes that drives decisions among the responsible teams.
All of these considerations can enhance how well a CI/CD workflow complements the time machine learning algorithms take to train on the data and return results for inspection.
Turning data into a valuable business decision is not simple. But as data transformations increasingly occur in applications and software-managed devices, managers are experimenting with software management techniques like CI/CD to keep complex machine learning models in step with good data management basics.