Michael Berthold on End-to-End Data Science Using KNIME Analytics Platform

Source:-infoq.com

Open source data analytics platform KNIME CEO and co-founder Michael Berthold gave the keynote presentation at this year’s KNIME Fall Summit 2019 Conference. He spoke about the end-to-end data science cycle which mainly includes Create and Productionize categories. The Create category includes “Gather & Wrangle” and “Model & Visualize” phases, whereas the Productionize category consists of “Deploy and Consume” & “Optimize” phases.

The Gather & Wrangle phase, when using KNIME software, supports several connectors & transformation nodes and big data extensions to integrate different technologies like Amazon Redshift, H2, Hive, Impala, and Apache Spark into your data science solutions. KNIME now supports the extended cloud file system connectivity.

The Model & Visualize phase includes integrations of libraries and tools for the data pipeline management. KNIME supports application specific libraries for use cases in the areas of life science, text processing, image & time-series analysis. There are also specialized implementations of data science frameworks like H2O and XGBoost. The users can also write custom code, if they need to, using languages like R and Python.

The deployment of the data science models to end users and the servers are the main focus of Deploy & Manage phase. This includes analytics applications and web services. The users can also use the KNIME Model Process Factory to retrain the models on demand. The factory supports automated model initialization and monitoring & alerting capabilities.

KNIME also provides the production support and governance of data science applications and services in terms of versioning, backwards compatibility, compliance and best practices.

Finally, the Consume & Optimize phase supports the consumption of ML workflows through direct deployment with workflow as a service as well as an analytics application.

Berthold also talked about three levels of data science automation. First generation automation basically includes clean data coming into data pipeline, then some pre-canned interactions are applied, and the model comes out of the process. Second generation of automation consists of customized interactions in Model & Visualize or Deploy & Manage phases and overall interactive use and continuous deployment. The third generation includes mix & match of the first models in that it provides automated solution for Gather & Wrangle and Consume & Optimize phases. Model & Visualize steps include customized interactions, whereas the Deploy & Manage steps would include guided interaction.

Berthold concluded the presentation with a discussion on data science abstraction using components for automated or guided interaction for ML pipeline phases like feature selection & engineering, interpretation, model management and most importantly, model deployment to production.

Data science adoption should be determined based on the business requirements and the needs of the users. This adoption can fall into one or more of the following four categories:

Standard problems without business impact where simple data science tools can do the job.
Applications which require no in-house domain or data science expertise can be solved using simple ML models.
Productionizing data science is critical for value creation which requires the continuous incorporation of new technologies and user feedback.
Data science innovation deeply embedded in business, which involves exploring latest ML/AI trends and injecting corporations’ own R&D into data science.
KNIME also has a new product called Cloud Cognitive Services. If you would like to learn more about the tools, check out KNIME Analytics Platform, a commerical product called KNIME Server which includes KNIME Web Portal and Data Science as a Service, and KNIME Hub.