Today we are sharing an update to the Azure HDInsight integration with Azure Data Lake Storage Gen 2. This integration will enable HDInsight customers to drive analytics from the data stored in Azure Data Lake Storage Gen 2 using popular open source frameworks such as Apache Spark, Hive, MapReduce, Kafka, Storm, and HBase in a secure manner.
Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 is the only data lake designed specifically for enterprises to run large scale analytics workloads in the cloud. It unifies the core capabilities from the first generation of Azure Data Lake with a Hadoop compatible file system endpoint now directly integrated into Azure Blob Storage. This enhancement combines the scale and cost benefits of object storage with the reliability and performance typically associated only with on-premises file systems. This new file system includes a full hierarchical namespace that makes files and folders first class citizens, translating to faster, more reliable analytics job execution.
Azure Data Lake Storage Gen2 also includes limitless storage ensuring capacity to meet the needs of even the largest, most complex workloads. In addition, Azure Data Lake Storage Gen2 delivers on native integration with Azure Active Directory and support POSIX compliant ACLs to enable granular permission assignments on files and folders.
Hadoop compatible access
Azure Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS). The ABFS driver is available within all Apache Hadoop environments. File systems are well understood by developers and users alike. There is no need to learn a new storage paradigm when you move to the cloud as the file system interface exposed by Azure Data Lake Storage Gen2 is the same paradigm used by computers, large and small.
Role based access control
The security model for Azure Data Lake Storage Gen2 supports ACL and POSIX permissions.
These storage ACL capabilities along with fine grain access control via apache Ranger in HDInsight for applications such as Spark, Kafka, Hive, and HBase make it very convenient to open up your data lake for entire organization with appropriate security control and auditing in place.
SSL only access
With this update, ADLS Gen 2 accounts can only be accessed via https protocol ensuring that only encrypted communication is possible between HDInsight and storage.
Azure Data Lake Storage Gen 2 and HDInsight are available across the globe, offering the scale needed to bring big data applications closer to users around the world, preserving data residency, and offering comprehensive compliance and resiliency options for customers.
Atomic directory manipulation
Object stores approximate a directory hierarchy by adopting a convention of embedding slashes (/) in the object name to denote path segments. While this convention works for organizing objects, the convention provides no assistance for actions like moving, renaming, or deleting directories. Without real directories, applications must process potentially millions of individual blobs to achieve directory-level tasks. By contrast, the hierarchical namespace processes these tasks by updating a single entry (the parent directory).
This dramatic optimization is especially significant for many big data analytics frameworks. Tools like Hive and Spark often write output to temporary locations and then rename the location at the conclusion of the job. Without the hierarchical namespace, this rename can often take longer than the analytics process itself. Lower job latency equals lower total cost of ownership (TCO) for analytics workloads.
HDInsight and Azure Data Lake Storage Gen2 bring new levels of scale for big data workloads. Customers can run workloads that scale at 100’s Gb/Sec to Petabytes of storage without needing to shard the data across multiple storage accounts.
Encryption at REST
Encryption in Azure Data Lake Storage Gen2 helps you protect your data, implement enterprise security policies, and meet regulatory compliance requirements. Azure Data Lake Storage Gen 2 supports encryption of data both at rest and in transit.
Integrated network firewall capabilities allow you to define rules restricting access only to requests originating from specified networks or HDInsight clusters in a specific VNET.
How does the integration work?
HDInsight and Azure Data Lake Storage Gen2 integration is based upon user-assigned managed identity. You assign appropriate access to HDInsight with your Azure Data Lake Storage Gen2 accounts. Once configured, your HDInsight cluster is able to use Azure Data Lake Storage Gen2 as its storage.
1. Create an Azure storage account and enable Data Lake Storage Gen 2 preview.
2. Create a user assigned managed identity.
3. Assign Storage Blob Data Owner access to the created managed identity on Azure Storage.
4. Now you can proceed to creating HDInsight cluster. In the storage blade, select the storage account, and the associated managed user identity, and proceed with cluster creation workflow.
We look forward to your comments and feedback. If there are any feature requests, customer asks, or suggestions, please contact us at firstname.lastname@example.org.
- Azure Data Lake Storage Gen2 introduction
- Hierarchical Namespace concept
- Create HDInsight cluster with ADLS Gen2
- Learn more about Azure HDInsight.
- Read the Open Source component guide on HDInsight.
- Review the HDInsight release notes.
- Ask HDInsight questions on MSDN forums.
- Ask HDInsight questions on StackOverflow.