Artificial intelligence requires trusted data, and a healthy DataOps ecosystem
Lately, we’ve seen many “x-Ops” management practices appear on the scene, all derivatives from DevOps, which seeks to coordinate the output of developers and operations teams into a smooth, consistent and rapid flow of software releases. Another emerging practice, DataOps, seeks to achieve a similarly smooth, consistent and rapid flow of data through enterprises. Like many things these days, DataOps is spilling over from the large Internet companies, who process petabytes and exabytes of information on a daily basis.
Photo: Joe McKendrick
Such an uninhibited data flow is increasingly vital to enterprises seeking to become more data-driven and scale artificial intelligence and machine learning to the point where these technologies can have strategic impact.
Awareness of DataOps is high. A recent survey of 300 companies by 451 Research finds 72 percent have active DataOps efforts underway, and the remaining 28 percent are planning to do so over the coming year. A majority, 86 percent, are increasing their spend on DataOps projects to over the next 12 months. Most of this spending will go to analytics, self-service data access, data virtualization, and data preparation efforts.
In the report, 451 Research analyst Matt Aslett defines DataOps as “The alignment of people, processes and technology to enable more agile and automated approaches to data management.”
The catch is “most enterprises are unprepared, often because of behavioral norms — like territorial data hoarding — and because they lag in their technical capabilities — often stuck with cumbersome extract, transform, and load (ETL) and master data management (MDM) systems,” according to Andy Palmer and a team of co-authors in their latest report, Getting DataOps Right, published by O’Reilly. Across most enterprises, data is siloed, disconnected, and generally inaccessible. There is also an abundance of data that is completely undiscovered, of which decision-makers are not even aware.
Here are some of Palmer’s recommendations for building and shaping a well-functioning DataOps ecosystem:
Keep it open: The ecosystem in DataOps should resemble DevOps ecosystems in which there are many best-of-breed free and open source software and proprietary tools that are expected to interoperate via APIs.” This also includes carefully evaluating and selecting from the raft of tools that have been developed by the large internet companies.
Automate it all: The collection, ingestion, organizing, storage and surfacing of massive amounts of data at as close to a near-real-time pace as possible has become almost impossible for humans to manage. Let the machines do it, Palmer urges. Areas ripe for automaton include “operations, repeatability, automated testing, and release of data.” Look to the ways DevOps is facilitating the automation of the software build, test, and release process, he points out.
Process data in both batch and streaming modes. While DataOps is about real-time delivery of data, there’s still a place — and reason — for batch mode as well. “The success of Kafka and similar design patterns has validated that a healthy next-generation data ecosystem includes the ability to simultaneously process data from source to consumption in both batch and streaming modes,” Palmer points out.
Track data lineage: Trust in the data is the single most important element in a data-driven enterprise, and it simply may cease to function without it. That’s why well-thought-out data governance and a metadata (data about data) layer is important. “A focus on data lineage and processing tracking across the data ecosystem results in reproducibility going up and confidence in data increasing,” says Palmer.
Have layered interfaces. Everyone touches data in different ways. “Some power users need to access data in its raw form, whereas others just want to get responses to inquiries that are well formulated,” Palmer says. That’s why a layered set of services and design patterns is required for the different personas of users. Palmer says there are three approaches to meeting these multilayered requirements:
“Data access services that are “View” abstractions over the data and are essentially SQL or SQL-like interfaces. This is the power-user level that data scientists prefer.
“Messaging services that provide the foundation for stateful data interchange, event processing, and data interchange orchestration.
“REST services built on or wrapped around APIs providing the ultimate flexible direct access to and interchange of data.”
Business leaders are increasingly leaning on their technology leaders and teams to transform their organizations into data-driven digital entities that can react to events and opportunities almost instantaneously. The best way to accomplish this — especially with the meager budgets and limited support that gets thrown out with this mandate — is to align the way data flows from source to storage.