• 3 min read

Introducing Dataiku’s DSS on Microsoft Azure HDInsight to make data science easier

This post introduces Dataiku's Data Science Studio on Azure HDInsight to make data science easier.

We are pleased to announce the expansion of HDInsight Application Platform to include Dataiku.

Azure HDInsight is the industry leading fully-managed cloud Apache Hadoop & Spark offering which allows customers to do reliable open source analytics with an industry-leading SLA. Dataiku develops Data Science Studio (DSS), a collaborative data science platform that enables companies to build and deliver their analytical solutions more efficiently.

This combined offering of DSS on HDInsight enables customers to easily use data science to build big data solutions and run them at enterprise grade and scale.

Microsoft Azure HDInsight – Reliable Open Source Analytics at Enterprise grade & scale

HDInsight is the only fully-managed cloud Hadoop offering that provides optimized open source analytical clusters for Spark, Hive, Interactive Hive, MapReduce, HBase, Storm, Kafka, and R Server backed by a 99.9% SLA. Each of these big data technologies are easily deployable as managed clusters with enterprise-level security and monitoring.

The ecosystem of applications in Big data has grown with the goal of making it easier for customers to solve their big data and analytical problems faster. Today, customers often find it challenging to discover these productivity applications and then in turn struggle to install and configure these apps.

To address this gap, the HDInsight Application Platform provides an experience unique to HDInsight where Independent Software Vendors (ISV’s) can directly offer their applications to customers – and customers can easily discover, install and use these applications built for the Big data ecosystem.
As part of this integration, Dataiku is bringing DSS to make collaborative data science much easier.

Dataiku Data Science Studio (DSS) – Prototype, deploy and run at scale

Dataiku provides Data Science Studio, the collaborative data science platform that enables professionals (data scientists, data engineers etc.) to collaborate on building analytical solutions. DSS has an easy to use team-based interface for data scientists and beginner analysts. A user can use DSS to implement a complete analytical solution – which could range from data ingestion (all data types, sizes, format etc.), data preparation, data processing, training and applying machine learning models, visualization and operationalizing the solution.

A user can use DSS to implement a complete analytical solution – which could range from data ingestion (all data types, sizes, format etc.), data preparation, data processing, training and applying machine learning models, visualization and operationalizing the solution.

DSS on HDInsight – Data science at enterprise grade & scale

A customer can install DSS on HDInsight using Hadoop or Spark clusters. They can install DSS on existing clusters which are running, or while creating new clusters. DSS 4.0 also added support for using Azure Blob Storage as a connector for reading data from.

When a user installs DSS on HDInsight, the user can make use of the benefits of Hadoop or Spark on HDInsight. Users can utilize DSS to build projects; the projects can generate MapReduce or Spark jobs, which makes DSS a great compliment to your HDInsight cluster. These jobs are executed as regular MapReduce or Spark jobs, and hence they get all the benefits of running these jobs on an enterprise grade platform. Since these jobs are running on HDInsight, customers can scale the cluster on demand, which allows a customer to run DSS at scale on HDInsight.

Getting started with DSS on HDInsight

Let us show a quick walkthrough of installing and getting started with DSS on HDInsight: The following screen shot shows a Spark cluster in the Microsoft Azure portal. A user can click the Applications tile to see the list of applications installed.

HDI Application

A user can select DSS, agree to the terms of agreement and install DSS. This is the simplicity associated with a one-click deployment experience. After the user has selected DSS, DSS is installed on the edge node, which is part of the cluster.

After DSS is installed, a customer can launch DSS using the “WEBPAGE” link. (This is the link to the DSS product.) A user must first authenticate with the cluster user credentials and then they can login with their DSS credentials

Launch DSS

The following screenshot shows what a typical data science project’s landing page would look like in DSS. This shows both the summary of the project, as well as the timeline of the changes made to the project.

DSS Project

Resources

Following are some resource on learning more on this integration along with tutorials and videos.

Summary

We are pleased to announce the expansion of HDInsight Application Platform to include Dataiku’s Data Science Studio. By deploying DSS on HDInsight, customers can easily build analytical solutions and run them at enterprise grade and scale.