Cloudera gets data science working out of the box

David Reed, director of research and editor-in-chief, DataIQ

“There is confusion around the term data science - it’s understood to mean different things.” According to Sean Owen, director of data science, Cloudera, part of the problem is the technology being used, especially the gap between the environment in which experimental models get built and where they will ultimately be deployed. As he told DataIQ in an interview, “it ranges from people who are software engineers through to PhDs who are great at modelling, but are not good at software at all.”

Sean Owen, director of data science, ClouderaThe result is a procedural barrier between the innovation team and their customers. “What typically happens is that analysts create models, then hand them over in a Word document to be implemented in the production environment. That works, but it should be more joined up,” explained Owen.

Cloudera has just put on general release a solution which it hopes will become the industry-standard bridge between those worlds. Data Science Workbench is a self-service tool for data scientists which allows them to use Python, R and Scala directly in the web browser, as well as providing access to libraries and frameworks, all within a customisable Hadoop environment that is customisable and also compliant.

A key dimension is the workbench’s ability to leverage deep learning libraries on CPU architecture without any additional hardware considerations or separate environments. It allows data science pipelines to be created natively in Spark and integrated with deep learning libraries, such as BigDL, and other Spark/Hadoop components.

“Everything data scientists have to do around scale or machine learning, like deep learning data, they don’t want to run on a laptop or work station which doesn’t have enough compute power. This is a big step for Hadoop which has not had a natively-integrated Web-based environment to do that in,” said Owen. “Scale and production are something Hadoop can now offer to data scientists.”

Because it allows users to develop models using tools and techniques that are familiar, but without the limitations which isolated tech stacks have previously imposed, the open source analytics vendor hopes data science will expand its adoption base beyond industries and functions where it is already mature, such as risk and underwriting within insurance or in banking.

Data scientists needed somewhere to run large-scale projects - a “there” as Owen descibes it - which would not restrict the scale of experiments. But, as he noted, “there was no ‘there’ there.” The workbench now allows them just to log in and start building. 

An integrated solution that works “out of the box” fixes a problem for many practitioners within the Hadoop eco-system. “They are not restricted to runnning just tests. Now they can write, edit and deploy code.” But Owen stresses that the toolkit is intended for exploratory analytics without scale limitations, not as a unified environment for development and production. 

Owen himself is a software engineer having got into Hadoop as a developer framework and then working on machine learning, then realising the problems of integrating data science tools with traditional analytical software like SAS and R. “Even a couple of years ago Hadoop was not a friendly place for conventional statisticians,” he said.

In his current role, he faces out towards clients, rather than being involved with software development, to help them build effective experiments and overcome technology hurdles. That is not a natural space for a practitioner to find themself in, he admits. “It can be challenging for data scientists because the analysts in those organisations are domain experts. We are experts with these tools and have a bit of domain knowledge which helps us to get them on the right track.”