• 7 min read

Avoid Big Data pitfalls with Azure HDInsight and these partner solutions

Whether you worked on an analytical project or starting one, it is a challenge on any cloud. You need to juggle the intricacies of cloud provider services, open source frameworks and the apps in the ecosystem.

According to a Gartner 2017 prediction, “60 percent of big data projects will fail to go beyond piloting and experimentation, these projects will be abandoned”.

Whether you worked on an analytical project or are starting one, it is a challenge on any cloud. You need to juggle the intricacies of cloud provider services, open source frameworks and the apps in the ecosystem. Apache Hadoop & Spark are very vibrant open source ecosystems which have enabled enterprises to digitally transform their businesses using data. According to Matt Turck VC at FirstMark, it has been an exciting but complex year in the data world. “The data tech ecosystem has continued to fire on all cylinders.  If nothing else, data is probably even more front and center in 2018, in both business and personal conversations”.

However, with great power comes greater responsibility from the ecosystem. There is a lot more than just using open source or a managed platform to a successful project. You have to deal with:

  • The complexity of combining all the open source frameworks.
  • Architecting a data lake to get insights for data engineers, data scientists and BI users.
  • Meeting enterprise regulations such as security, access control, data sovereignty & governance.
  • Handling business continuity and disaster recovery.
  • Choosing and vetting apps from the ecosystem and running this entire solution reliably.

As you can imagine, this is a daunting task and often times projects fail at deployment time or customers are unable to get insights from deployed systems due to lack of expertise or the projects don’t meet the requirements of Enterprise IT and are shut down.

So how do you escape this Gartner 2017 statistic?

Power of Azure, Open Source and partners

Big data and the analytical application lifecycle spans a number of steps. Including ingestion, prep, storage, processing, analyzing and visualization. All of these steps need to have enterprise requirements around governance, access control, monitoring, security and more. Stitching an application together which comprises everything is a complicated task. As explained before this is the biggest reason for the failure for big data applications.

With Azure HDInsight Application Platform we have solved this problem by working tightly with a selected set of ISV’s to certify their solutions with Azure HDInsight and other analytical services so customers can deploy them with a single-click. Applications are deployed natively with the cluster, so they can meet the existing enterprise setup around network security and access control polices. You can try these applications to evaluate them before you decide to purchase them which gives you flexibility as a customer to evaluate partners before signing a deeper engagement.

Screen Shot 2018-07-01 at 10.39.02 AM

Microsoft has always cared deeply about partners and customers, and this platform natively integrates partner solutions with Microsoft Azure foundational services such monitoring, security, authentication, encryption etc. allowing customers to seamlessly use the entire spectrum of Azure services, open source frameworks and partner solutions to get insights from data faster.

Peter Scott, SVP of Business Development WANdisco describes the value, “Azure HDInsight Application is an excellent choice for partners to get the highest ROI for their analytics investment. The one-click deploy experience has made our product more discoverable, and removes the guesswork and friction around discovering, installing and integrating with existing enterprise environments. Combine this with WANdisco Fusion and a LiveData capability, customers are perfectly positioned to realize data resiliency and guaranteed data accessibility on their hybrid cloud journey”.

Screen Shot 2018-07-01 at 10.16.27 AM

Thus, you get the best of all 3 worlds. The strength of the Azure platform, the flexibility and managed OSS platform with Azure HDInsight and the best of our partner ecosystem that makes BI professionals, data stewards and engineers more productive while giving the necessary governance and protection demanded by your IT administrators.

Let’s dive in into each of these challenging areas and how partners are helping solve this.

Pick a task, pick a partner

Make ingesting data easier

Any analytical solution starts with ingesting data from various sources for analytics. Big Data is all about volume, variety and velocity. As streaming and batch data changes, maintaining SLA’s around data quality is a challenge.

StreamSets-logoStreamSets provides a full-featured integrated development environment (IDE) that lets you design, test, deploy, and manage any-to-any ingest pipelines that mesh stream and batch data, and include a variety of in-stream transformations – all without having to write custom code. Try StreamSets on HDInsight.


The StriimTM platform makes it easy to integrate, analyze, and visualize streaming data across cloud, Big Data, and IoT devices, helping you make smart and timely operational decisions. Try Striim on HDInsight.

Simplify data prep

Extract-transform-load (ETL) processes are fairly time consuming. The challenge with data prep is around combining data from various sources and exploring the quality of the data, merging schema’s, removing bad data etc. Due to lack of expertise in open source frameworks, customers end up spending writing lots of scripts for ETL. Here’s how partners help simplify data prep.


Paxata's Adaptive Information Platform enables any user to gain insights from their data faster. Business users can combine unstructured and structured data from various sources, prepare data and analyze the data. Try Paxata on HDInsight.


Trifacta is a data wrangling solution for big data allowing you to easily transform and enrich raw, complex data into clean and structured formats for the purpose of exploratory analytics. Try Trifacta on HDInsight.

Use analytics & AI to transform your business

Continuing on the process of a big data application, a data scientist can do machine learning or deep learning. This involves collaborating in a team and using industry’s most popular open source libraries. Due to the complexity of the ecosystem, installing and configuring the toolsets available is challenging for novice users.


Dataiku provides Data Science Studio (DSS), the collaborative data science platform that enables professionals (data scientists, data engineers, etc.) to collaborate on building analytical solutions. DSS has an easy to use team-based interface for data scientists and beginner analysts. Try Dataiku on HDInsight.


H2O's AI platform is an open source machine learning that works with Spark 2.0+, sparklyr, and PySpark. H2O Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. Try H2O.ai on HDInsight.


KNIME Analytics Platform is the leading open solution for data-driven innovation, designed for discovering the potential hidden in data, mining for fresh insights, or predicting new futures. Organizations can take their collaboration, productivity and performance to the next level with a robust range of commercial extensions to our open source platform. Try KNIME on HDInsight.

Serve up new business insights

BI over data lake is hard since traditional tools don’t work with unstructured or streaming data. To complete the BI journey customers must move the data from a data lake to a relational store. For Big Data customers who have petabytes of data, operationalizing this data movement is challenging.

imageAtScale is a BI on Hadoop software solution.  It allows business users to enjoy multi-dimensional and ad-hoc analysis capabilities on Hadoop data without any movement or client drivers, at OLAP speed and directly from standard BI tools like Microsoft Excel, Power BI and Tableau. Try AtScale on HDInsight.

imageKyligence flagship product Kyligence Enterprise, powered by Apache Kylin, brings instant insights on massive datasets for business users and data analysts. With cutting edge machine learning technology and intelligent data modeling functionality, it greatly improves productivity of big data analytics. Try Kyligence on HDInsight.

Stay safe with robust data governance

Data cataloguing/governance is a key ask from enterprises. This allows them to discover data, define access control and monitor patterns for data security. It also allows users to easily discover data sets across an organization. However, building up such a system is challenging because of data silos across an organization.


Waterline Data provides a data catalog solution that enables both a governed data lake as well as provides an automated data catalog as a shared service across multiple data prep, data discovery, and exploratory analytics tools. As a result, Waterline Data gives agility and speed to the business to find the best suited data quickly without manual exploration and coding – staying above the waterline of the data lake, while enabling IT to provide a data layer that is governed and can stay ahead of the need for self-service access to data.

Tune performance for best ROI

The inherent complexity of big data systems, disparate set of tools for monitoring, and lack of expertise in optimizing these open source frameworks create significant challenges for end-users who are responsible for guaranteeing SLAs.


Unravel provides comprehensive application performance management (APM) for. The application helps customers analyze, optimize, and troubleshoot application performance issues and meet SLAs in a seamless, easy to use, and frictionless manner. Some customers report up to 200 percent more jobs at 50 percent lower cost using Unravel’s tuning capability on HDInsight. Getting started guide with Unravel on HDInsight.

Hybrid Big Data and globally replicated Data Lake

Connecting on-premise big data applications to the cloud has always been a challenge. Customers have to think about constantly changing data and metadata (schema of tables in Hive, authorization policies in Ranger, Sentry etc.) for applications. Connecting on-prem to cloud without any down-time is challenging and increases the time to market for customers.


WANdisco provides live replication of selected data/ metadata at scale between multiple Big Data and Cloud environments. With guaranteed data consistency and continuous availability, HDInsight customers can easily setup hybrid environments and a disaster recovery solution. Try WANdisco on HDInsight. How to replicate data for hybrid and disaster recovery solutions.

Customer or partner, there’s no better time to start!

If you are a customer, there is no better time to include a partner in the architecture and build out of your solution. They facilitate and accelerate your path to production. You will reap savings through their knowledge, experience and specialization.

Know that Azure HDInsight Application Platform is one of the faster growing partner ecosystems in Azure. Partner contribution has increased over 200 percent last year. Customers are discovering partner solutions organically and this platform is driving more awareness of the ecosystem to customers. There is a strong momentum of customers moving their Hadoop/ Spark solutions to the cloud and with the plus 50 percent price cut on HDInsight there is more opportunity than ever to innovate.

If you are a potential partner and think your company could help complete Microsoft’s Analytical offerings, there has not been a better time to partner with Microsoft and help customers be successful in their analytical journey. Come join us! Contact bigdatapartners@microsoft.com