• 5 min read

New security, performance and ISV solutions build on Azure HDInsight’s leadership to make Hadoop enterprise-ready for the cloud

This week in New York, thousands of people are at Strata Hadoop+World to explore the technology and business of big data and advanced analytics.

This week in New York, thousands of people are at Strata Hadoop+World to explore the technology and business of big data and advanced analytics. As part of our participation in the conference, we are pleased to announce new capabilities in Azure HDInsight, Microsoft’s managed Hadoop and Spark cloud services, that build on our leadership to make Hadoop enterprise-ready in the cloud and easy for your users with the most security capabilities of any cloud Hadoop solution, big data query speeds that approach data warehousing performance, and new notebook experiences for data scientists all on the latest Hortonworks Data Platform 2.5 and Spark 2.0 platform.

The highest levels of security in a managed Cloud Hadoop solution

To support the adoption of Hadoop in the cloud, Microsoft understands that enterprises need peace of mind that the solution will help protect sensitive corporate data and intellectual property. With the new security features of Azure HDInsight, we provide you with the highest levels of security for authentication, authorization, auditing and encryption available in the cloud for Hadoop.

Authentication and identity management in a few clicks

Azure HDInsight is the first big data service to seamlessly integrate Azure Active Directory and Azure Active Directory Domain Services for enterprise-grade authentication and identity management. This is accomplished with a few clicks, making it easy to secure your Hadoop clusters. This also makes it easy to leverage your existing on-premises Active Directory deployment, which currently supports 1.3 billion daily authentications across 600 million user accounts. You can build sophisticated access control policies around users or security groups supported by features such as multifactor authentication.

Authorization with central security policy administration and auditing

Azure HDInsight is the first managed cloud Hadoop service to include Apache Ranger, which provides a central policy and management portal where administrators can author and maintain fine-grained access control policies over Hadoop data access, components and services. In addition, you can now analyze detailed audit records in the familiar Apache Ranger user interface.

Encryption for data protection

Data processed by Azure HDInsight is stored in Azure Data Lake Store or Azure Storage that both provide server-side encryption as an option to secure data at rest. The encryption works transparently with HDInsight with no extra configuration needed. For Azure Data Lake Store, enterprises can rely on service-managed encryption keys or manage their own keys in Azure Key Vault. Azure Key Vault protects your keys using hardware security models and gives you the ability to revoke access to the keys at any time.

These advanced security capabilities will be available as a public preview in October.

HDInsight now at data warehousing speeds with the latest Hive using LLAP

Microsoft has been involved from the beginning in making Hive run faster with our contributions to Project Stinger and Tez that sped up Hive query performance 100x. We are now pleased to be the first Cloud Hadoop solution to onboard LLAP (Long Lived and Process) from the Stinger.Next initiatives, which promises sub-second querying on big data, which is 25x faster than existing Hive.

LLAP keeps data compressed running in-memory, while retaining the ability to scale elastically within a Hadoop cluster. It also brings many enhancements to the Hive execution engine like Smarter Map Joins, Better MapJoin vectorization, a fully vectorized pipeline, and a smarter cost-based optimizer. In addition to these LLAP enhancements, the latest version of Hive also has faster type conversions, dynamic partitioning optimizations and vectorization support for text files. Collectively, these enhancements have brought a speed improvement of up to 25x when comparing LLAP to Hive on Tez, opening up new scenarios to do interactive BI and reporting on top of big data.

In addition, Microsoft has partnered with Simba to deliver an ODBC driver for Azure HDInsight that can be used with world-class BI tools like Power BI, Tableau and QlikView. Together, this allows business analysts to gain insights over big data using their tool of choice. 

Hive-2.1-blog-Hive-Tez-vs-LLAP

Figure 1: Hortonworks TPC-DS benchmark on 15 queries using the hive-testbench repository: here

Microsoft continues commitment to Spark with a fully managed, SLA-backed Spark 2.0 offering

Spark 2.0 is a major release that overhauls the core query engine with “Project Tungsten,” which upgrades Spark with capabilities of a modern compiler to perform cache-efficient vectorized computations. This has enabled up to 10x faster performance with Spark 2.0 on an already-fast platform. In addition to faster performance, Spark 2.0 also has broader support of the SQL syntax, an improved streaming engine that makes it easier to build real-time solutions, improvements to the Machine Learning pipelines, and more algorithms supported in SparkR. Finally, in response to customer demand, Microsoft and Hortonworks included 100+ fixes for Spark 2.0, improving its stability for production deployments.

With the latest release of Apache HBase for HDInsight, we are also introducing a Spark-HBase connector, letting you use the performance and power of Spark SQL to query HBase. This lets you perform advanced analytics on top of all the data available in your NoSQL database.

Both the latest Hortonworks Data Platform 2.5 and Spark 2.0 are available in Azure HDInsight later today. Hive with LLAP is a new cluster type available as a public preview.

New data science experiences with Zeppelin notebook support

Our goal with big data is to make it accessible for everybody. With Spark for HDInsight, we have designed productivity experiences for the different audiences that use Spark, including the data engineer working on ETL jobs with IntelliJ support, the data scientists performing experimentation with R Server and Jupyter notebook support, and the business analysts creating dashboards with Power BI, Tableau, SAP Lumira and Qlik support.

As part of HDInsight’s support for Hortonworks Data Platform 2.5, we now provide out-of-the-box support for Zeppelin notebooks available later today to give data scientists even more options to create narratives that combine code, statistical equations and visualizations that tell a story about the data.

The easiest way to spin up third-party ISV applications with HDInsight

In the broader ecosystem for Hadoop, there is a thriving market of independent software vendors (ISVs) that provides value-added solutions which help organizations do data preparation, and provide visualizations, advanced security or streaming solutions. In the past these applications would sit outside the cluster, which required spinning up separate virtual machine; also, the connectivity to the Hadoop cluster was limited. Azure HDInsight introduced a way for ISVs such as Datameer to run their applications directly on the HDInsight clusters, letting customers spin up Hadoop and Spark clusters pre-integrated and pre-tuned with the ISV application out-of-the-box. 

“Azure HDInsight Application Platform is the most robust and stable framework we've seen to quickly configure and test Datameer deployments in the cloud,” says Stefan Groschupf, Datameer CEO. “We had all the flexibility to iteratively test different deployment options for our solution as well as marketing collateral within the same portal. It is by far the easiest and fastest way to take your cloud-based solution to market. As a partner, HDInsight application platform has allowed us to connect with customers easily and reduce the time for customers to try Datameer on HDInsight.”

Today, we are excited to announce that new partners Cask and StreamSets join the Azure HDInsight ISV program. Cask provides a self-service, extendable open source framework to visually develop, run, automate and operate data pipelines. StreamSets Dataflow Performance Manager provides a single pane of glass for management of big data flows, so enterprises can map and measure all their data in motion.

This week the big data world is focused on Strata + Hadoop World, a great event for the industry and community. It’s exciting to consider the new ideas and innovations happening around the world every day with data. Here at Microsoft, we’re thrilled to be part of it and to fuel that innovation with data solutions that give customers simple but powerful capabilities, using their choice of tools and platforms in the cloud.