Skip to main content

 Subscribe

Today we are sharing an update to the Azure HDInsight integration with Azure Data Lake Storage Gen 2. This integration will enable HDInsight customers to drive analytics from the data stored in Azure Data Lake Storage Gen 2 using popular open source frameworks such as Apache Spark, Hive, MapReduce, Kafka, Storm, and HBase in a secure manner.

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 is the only data lake designed specifically for enterprises to run large scale analytics workloads in the cloud. It unifies the core capabilities from the first generation of Azure Data Lake with a Hadoop compatible file system endpoint now directly integrated into Azure Blob Storage. This enhancement combines the scale and cost benefits of object storage with the reliability and performance typically associated only with on-premises file systems. This new file system includes a full hierarchical namespace that makes files and folders first class citizens, translating to faster, more reliable analytics job execution.

Azure Data Lake Storage Gen2 also includes limitless storage ensuring capacity to meet the needs of even the largest, most complex workloads. In addition, Azure Data Lake Storage Gen2 delivers on native integration with Azure Active Directory and support POSIX compliant ACLs to enable granular permission assignments on files and folders.

Key benefits

Hadoop compatible access

Azure Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS). The ABFS driver is available within all Apache Hadoop environments. File systems are well understood by developers and users alike. There is no need to learn a new storage paradigm when you move to the cloud as the file system interface exposed by Azure Data Lake Storage Gen2 is the same paradigm used by computers, large and small.

Role based access control

The security model for Azure Data Lake Storage Gen2 supports ACL and POSIX permissions.

These storage ACL capabilities along with fine grain access control via apache Ranger in HDInsight for applications such as Spark, Kafka, Hive, and HBase make it very convenient to open up your data lake for entire organization with appropriate security control and auditing in place.

SSL only access

With this update, ADLS Gen 2 accounts can only be accessed via https protocol ensuring that only encrypted communication is possible between HDInsight and storage.

Global availability

Azure Data Lake Storage Gen 2 and HDInsight are available across the globe, offering the scale needed to bring big data applications closer to users around the world, preserving data residency, and offering comprehensive compliance and resiliency options for customers.

Atomic directory manipulation

Object stores approximate a directory hierarchy by adopting a convention of embedding slashes (/) in the object name to denote path segments. While this convention works for organizing objects, the convention provides no assistance for actions like moving, renaming, or deleting directories. Without real directories, applications must process potentially millions of individual blobs to achieve directory-level tasks. By contrast, the hierarchical namespace processes these tasks by updating a single entry (the parent directory).

This dramatic optimization is especially significant for many big data analytics frameworks. Tools like Hive and Spark often write output to temporary locations and then rename the location at the conclusion of the job. Without the hierarchical namespace, this rename can often take longer than the analytics process itself. Lower job latency equals lower total cost of ownership (TCO) for analytics workloads.

Scale

HDInsight and Azure Data Lake Storage Gen2 bring new levels of scale for big data workloads. Customers can run workloads that scale at 100’s Gb/Sec to Petabytes of storage without needing to shard the data across multiple storage accounts.

Encryption at REST

Encryption in Azure Data Lake Storage Gen2 helps you protect your data, implement enterprise security policies, and meet regulatory compliance requirements. Azure Data Lake Storage Gen 2 supports encryption of data both at rest and in transit.

Network firewall

Integrated network firewall capabilities allow you to define rules restricting access only to requests originating from specified networks or HDInsight clusters in a specific VNET.

How does the integration work?

HDInsight and Azure Data Lake Storage Gen2 integration is based upon user-assigned managed identity. You assign appropriate access to HDInsight with your Azure Data Lake Storage Gen2 accounts. Once configured, your HDInsight cluster is able to use Azure Data Lake Storage Gen2 as its storage.

1. Create an Azure storage account and enable Data Lake Storage Gen 2 preview.

Storage configuration in Data Lake Storage Gen 2

2. Create a user assigned managed identity.

Creating a user managed identity

3. Assign Storage Blob Data Owner access to the created managed identity on Azure Storage.

Assigning Storage Blob Data Owner access

4. Now you can proceed to creating HDInsight cluster. In the storage blade, select the storage account, and the associated managed user identity, and proceed with cluster creation workflow.

Selecting the storage account and associated managed user identity

Getting started

Start using Azure Data Lake Storage Gen2 with Azure HDInsight today.

Feedback

We look forward to your comments and feedback. If there are any feature requests, customer asks, or suggestions, please contact us at askhdinsight@microsoft.com.

Additional resources

  • Explore

     

    Let us know what you think of Azure and what you would like to see in the future.

     

    Provide feedback

  • Build your cloud computing and Azure skills with free courses by Microsoft Learn.

     

    Explore Azure learning